What technology area does this patent fall under?

Primary CPC classification G06N5/04. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Neural-symbolic action transformers for video question answering

US12175384B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12175384-B2
Application number	US-202117381408-A
Country	US
Kind code	B2
Filing date	Jul 21, 2021
Priority date	Jul 21, 2021
Publication date	Dec 24, 2024
Grant date	Dec 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Mechanisms are provided for performing artificial intelligence-based video question answering. A video parser parses an input video data sequence to generate situation data structure(s), each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in images of the input video data sequence. First machine learning computer model(s) operate on the situation data structure(s) to predict second relationship(s) between the situation data structure(s). Second machine learning computer model(s) execute on a received input question to predict an executable program to execute to answer the received question. The program is executed on the situation data structure(s) and predicted second relationship(s). An answer to the question is output based on results of executing the program.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a data processing system, for performing artificial intelligence-based video question answering, the method comprising: parsing, by a video parser of the data processing system, an input video data sequence to generate a plurality of situation data structures, each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in one or more images of the input video data sequence; executing at least one first machine learning computer model of the data processing system on the plurality of situation data structures to predict one or more second relationships between at least two of the situation data structures in the plurality of situation data structures; determining, by at least one second machine learning computer model of the data processing system executing on a received input natural language question, an executable program to execute to answer the received input natural language question; executing, by the data processing system, the determined executable program on the plurality of situation data structures and the predicted one or more second relationships; and outputting, by the data processing system, an answer to the input natural language question based on results of executing the determined executable program, wherein the at least one first machine learning computer model comprises a situation encoder that encodes the situation data structures to generate a generated token sequence corresponding to the entities and relationships represented in the plurality of situation data structures, and a machine learning trained dynamics transformer computer model that processes the generated token sequence and predicts a subsequent token sequence, subsequent to the generated token sequence, to thereby generate a predicted token sequence. 2. The method of claim 1 , wherein the at least one first machine learning computer model further comprises a sequence decoder, and wherein the sequence decoder generates the predicted one or more second relationships, based on the predicted token sequence, as one or more predicted hypergraph data structures. 3. The method of claim 2 , wherein the at least one second machine learning computer model comprises a language/program parser and program executor, and wherein determining the executable program comprises: processing, by the language/program parser, the input natural language question to predict, based on terms or phrases in the input natural language question, a plurality of program modules to execute to answer the input natural language question, wherein the plurality of program modules are selected as a subset of a set of predefined program modules stored in a program module library; dynamically combining, by a program executor, the plurality of program modules into an executable program that is executed on the one or more predicted hypergraph data structures to generate a final answer to the input natural language question; and outputting, by the data processing system, the final answer to the input natural language question. 4. The method of claim 2 , wherein each hypergraph data structure comprises one or more hyperedges connecting a first situation data structure in the plurality of situation data structures, to at least one second situation data structure. 5. The method of claim 4 , wherein each hyperedge in the one or more hyperedges comprises a predicted action corresponding to at least one first entity in the first situation data structure with at least one second entity in the at least one second entity data structure. 6. The method of claim 1 , wherein the input natural language question is a logical reasoning question of either an interaction question type, a sequence question type, a prediction question type, or a feasibility question type. 7. A computer program product comprising a non-transitory computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed in a data processing system, causes the data processing system to: parse, by a video parser of the data processing system, an input video data sequence to generate a plurality of situation data structures, each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in one or more images of the input video data sequence; execute at least one first machine learning computer model of the data processing system on the plurality of situation data structures to predict one or more second relationships between at least two of the situation data structures in the plurality of situation data structures; determine, by at least one second machine learning computer model of the data processing system executing on a received input natural language question, an executable program to execute to answer the received input natural language question; execute, by the data processing system, the determined executable program on the plurality of situation data structures and the predicted one or more second relationships; and output, by the data processing system, an answer to the input natural language question based on results of executing the determined executable program, wherein the at least one first machine learning computer model comprises a situation encoder that encodes the situation data structures to generate a generated token sequence corresponding to the entities and relationships represented in the plurality of situation data structures, and a machine learning trained dynamics transformer computer model that processes the generated token sequence and predicts a subsequent token sequence, subsequent to the generated token sequence, to thereby generate a predicted token sequence. 8. The computer program product of claim 7 , wherein the at least one first machine learning computer model further comprises a sequence decoder, and wherein the sequence decoder generates the predicted one or more second relationships, based on the predicted token sequence, as one or more predicted hypergraph data structures. 9. The computer program product of claim 8 , wherein the at least one second machine learning computer model comprises a language/program parser and program executor, and wherein the computer readable program further causes the data processing system to determine the executable program at least by: processing, by the language/program parser, the input natural language question to predict, based on terms or phrases in the input natural language question, a plurality of program modules to execute to answer the input natural language question, wherein the plurality of program modules are selected as a subset of a set of predefined program modules stored in a program module library; dynamically combining, by a program executor, the plurality of program modules into an executable program that is executed on the one or more predicted hypergraph data structures to generate a final answer to the input natural language question; and outputting, by the data processing system, the final answer to the input natural language question. 10. The computer program product of claim 8 , wherein each hypergraph data structure comprises one or more hyperedges connecting a first situation data structure in the plurality of situation data structures, to at least one second situation data structure. 11. The computer program product of claim 10 , wherein each hyperedge in the one or more hyperedges comprises a predicted action corresponding to at least one first entity in the first situation data structure wi

Assignees

Inventors

Classifications

G06V20/49
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
G06F40/205
Parsing · CPC title
G06N5/02
Knowledge representation; Symbolic representation · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06N20/20
Ensemble learning · CPC title

Patent family

Related publications grouped by family.

View patent family 84976959

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12175384B2 cover?: Mechanisms are provided for performing artificial intelligence-based video question answering. A video parser parses an input video data sequence to generate situation data structure(s), each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in images of the input video data sequence. …
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06N5/04. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Self-supervised visual-relationship probing

Systems and methods for high-order modeling of predictive hypotheses

Scene-Aware Video Dialog

Activation of remote devices in a networked system

Multimodal Entity and Coreference Resolution for Assistant Systems

Evaluating user responses based on bootstrapped knowledge acquisition from a limited knowledge domain

Progressively Extending Conversation Scope in Multi-User Messaging Platform

Frequently asked questions