Neural-symbolic action transformers for video question answering

US12175384B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12175384-B2
Application numberUS-202117381408-A
CountryUS
Kind codeB2
Filing dateJul 21, 2021
Priority dateJul 21, 2021
Publication dateDec 24, 2024
Grant dateDec 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Mechanisms are provided for performing artificial intelligence-based video question answering. A video parser parses an input video data sequence to generate situation data structure(s), each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in images of the input video data sequence. First machine learning computer model(s) operate on the situation data structure(s) to predict second relationship(s) between the situation data structure(s). Second machine learning computer model(s) execute on a received input question to predict an executable program to execute to answer the received question. The program is executed on the situation data structure(s) and predicted second relationship(s). An answer to the question is output based on results of executing the program.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a data processing system, for performing artificial intelligence-based video question answering, the method comprising: parsing, by a video parser of the data processing system, an input video data sequence to generate a plurality of situation data structures, each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in one or more images of the input video data sequence; executing at least one first machine learning computer model of the data processing system on the plurality of situation data structures to predict one or more second relationships between at least two of the situation data structures in the plurality of situation data structures; determining, by at least one second machine learning computer model of the data processing system executing on a received input natural language question, an executable program to execute to answer the received input natural language question; executing, by the data processing system, the determined executable program on the plurality of situation data structures and the predicted one or more second relationships; and outputting, by the data processing system, an answer to the input natural language question based on results of executing the determined executable program, wherein the at least one first machine learning computer model comprises a situation encoder that encodes the situation data structures to generate a generated token sequence corresponding to the entities and relationships represented in the plurality of situation data structures, and a machine learning trained dynamics transformer computer model that processes the generated token sequence and predicts a subsequent token sequence, subsequent to the generated token sequence, to thereby generate a predicted token sequence. 2. The method of claim 1 , wherein the at least one first machine learning computer model further comprises a sequence decoder, and wherein the sequence decoder generates the predicted one or more second relationships, based on the predicted token sequence, as one or more predicted hypergraph data structures. 3. The method of claim 2 , wherein the at least one second machine learning computer model comprises a language/program parser and program executor, and wherein determining the executable program comprises: processing, by the language/program parser, the input natural language question to predict, based on terms or phrases in the input natural language question, a plurality of program modules to execute to answer the input natural language question, wherein the plurality of program modules are selected as a subset of a set of predefined program modules stored in a program module library; dynamically combining, by a program executor, the plurality of program modules into an executable program that is executed on the one or more predicted hypergraph data structures to generate a final answer to the input natural language question; and outputting, by the data processing system, the final answer to the input natural language question. 4. The method of claim 2 , wherein each hypergraph data structure comprises one or more hyperedges connecting a first situation data structure in the plurality of situation data structures, to at least one second situation data structure. 5. The method of claim 4 , wherein each hyperedge in the one or more hyperedges comprises a predicted action corresponding to at least one first entity in the first situation data structure with at least one second entity in the at least one second entity data structure. 6. The method of claim 1 , wherein the input natural language question is a logical reasoning question of either an interaction question type, a sequence question type, a prediction question type, or a feasibility question type. 7. A computer program product comprising a non-transitory computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed in a data processing system, causes the data processing system to: parse, by a video parser of the data processing system, an input video data sequence to generate a plurality of situation data structures, each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in one or more images of the input video data sequence; execute at least one first machine learning computer model of the data processing system on the plurality of situation data structures to predict one or more second relationships between at least two of the situation data structures in the plurality of situation data structures; determine, by at least one second machine learning computer model of the data processing system executing on a received input natural language question, an executable program to execute to answer the received input natural language question; execute, by the data processing system, the determined executable program on the plurality of situation data structures and the predicted one or more second relationships; and output, by the data processing system, an answer to the input natural language question based on results of executing the determined executable program, wherein the at least one first machine learning computer model comprises a situation encoder that encodes the situation data structures to generate a generated token sequence corresponding to the entities and relationships represented in the plurality of situation data structures, and a machine learning trained dynamics transformer computer model that processes the generated token sequence and predicts a subsequent token sequence, subsequent to the generated token sequence, to thereby generate a predicted token sequence. 8. The computer program product of claim 7 , wherein the at least one first machine learning computer model further comprises a sequence decoder, and wherein the sequence decoder generates the predicted one or more second relationships, based on the predicted token sequence, as one or more predicted hypergraph data structures. 9. The computer program product of claim 8 , wherein the at least one second machine learning computer model comprises a language/program parser and program executor, and wherein the computer readable program further causes the data processing system to determine the executable program at least by: processing, by the language/program parser, the input natural language question to predict, based on terms or phrases in the input natural language question, a plurality of program modules to execute to answer the input natural language question, wherein the plurality of program modules are selected as a subset of a set of predefined program modules stored in a program module library; dynamically combining, by a program executor, the plurality of program modules into an executable program that is executed on the one or more predicted hypergraph data structures to generate a final answer to the input natural language question; and outputting, by the data processing system, the final answer to the input natural language question. 10. The computer program product of claim 8 , wherein each hypergraph data structure comprises one or more hyperedges connecting a first situation data structure in the plurality of situation data structures, to at least one second situation data structure. 11. The computer program product of claim 10 , wherein each hyperedge in the one or more hyperedges comprises a predicted action corresponding to at least one first entity in the first situation data structure wi

Assignees

Inventors

Classifications

  • Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title

  • Parsing · CPC title

  • Knowledge representation; Symbolic representation · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Ensemble learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12175384B2 cover?
Mechanisms are provided for performing artificial intelligence-based video question answering. A video parser parses an input video data sequence to generate situation data structure(s), each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in images of the input video data sequence. …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N5/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).