Self-supervised visual-relationship probing
US-2022147838-A1 · May 12, 2022 · US
US12175384B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12175384-B2 |
| Application number | US-202117381408-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 21, 2021 |
| Priority date | Jul 21, 2021 |
| Publication date | Dec 24, 2024 |
| Grant date | Dec 24, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Mechanisms are provided for performing artificial intelligence-based video question answering. A video parser parses an input video data sequence to generate situation data structure(s), each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in images of the input video data sequence. First machine learning computer model(s) operate on the situation data structure(s) to predict second relationship(s) between the situation data structure(s). Second machine learning computer model(s) execute on a received input question to predict an executable program to execute to answer the received question. The program is executed on the situation data structure(s) and predicted second relationship(s). An answer to the question is output based on results of executing the program.
Opening claim text (preview).
What is claimed is: 1. A method, in a data processing system, for performing artificial intelligence-based video question answering, the method comprising: parsing, by a video parser of the data processing system, an input video data sequence to generate a plurality of situation data structures, each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in one or more images of the input video data sequence; executing at least one first machine learning computer model of the data processing system on the plurality of situation data structures to predict one or more second relationships between at least two of the situation data structures in the plurality of situation data structures; determining, by at least one second machine learning computer model of the data processing system executing on a received input natural language question, an executable program to execute to answer the received input natural language question; executing, by the data processing system, the determined executable program on the plurality of situation data structures and the predicted one or more second relationships; and outputting, by the data processing system, an answer to the input natural language question based on results of executing the determined executable program, wherein the at least one first machine learning computer model comprises a situation encoder that encodes the situation data structures to generate a generated token sequence corresponding to the entities and relationships represented in the plurality of situation data structures, and a machine learning trained dynamics transformer computer model that processes the generated token sequence and predicts a subsequent token sequence, subsequent to the generated token sequence, to thereby generate a predicted token sequence. 2. The method of claim 1 , wherein the at least one first machine learning computer model further comprises a sequence decoder, and wherein the sequence decoder generates the predicted one or more second relationships, based on the predicted token sequence, as one or more predicted hypergraph data structures. 3. The method of claim 2 , wherein the at least one second machine learning computer model comprises a language/program parser and program executor, and wherein determining the executable program comprises: processing, by the language/program parser, the input natural language question to predict, based on terms or phrases in the input natural language question, a plurality of program modules to execute to answer the input natural language question, wherein the plurality of program modules are selected as a subset of a set of predefined program modules stored in a program module library; dynamically combining, by a program executor, the plurality of program modules into an executable program that is executed on the one or more predicted hypergraph data structures to generate a final answer to the input natural language question; and outputting, by the data processing system, the final answer to the input natural language question. 4. The method of claim 2 , wherein each hypergraph data structure comprises one or more hyperedges connecting a first situation data structure in the plurality of situation data structures, to at least one second situation data structure. 5. The method of claim 4 , wherein each hyperedge in the one or more hyperedges comprises a predicted action corresponding to at least one first entity in the first situation data structure with at least one second entity in the at least one second entity data structure. 6. The method of claim 1 , wherein the input natural language question is a logical reasoning question of either an interaction question type, a sequence question type, a prediction question type, or a feasibility question type. 7. A computer program product comprising a non-transitory computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed in a data processing system, causes the data processing system to: parse, by a video parser of the data processing system, an input video data sequence to generate a plurality of situation data structures, each situation data structure comprising data elements corresponding to entities, and first relationships between entities, identified by the video parser as present in one or more images of the input video data sequence; execute at least one first machine learning computer model of the data processing system on the plurality of situation data structures to predict one or more second relationships between at least two of the situation data structures in the plurality of situation data structures; determine, by at least one second machine learning computer model of the data processing system executing on a received input natural language question, an executable program to execute to answer the received input natural language question; execute, by the data processing system, the determined executable program on the plurality of situation data structures and the predicted one or more second relationships; and output, by the data processing system, an answer to the input natural language question based on results of executing the determined executable program, wherein the at least one first machine learning computer model comprises a situation encoder that encodes the situation data structures to generate a generated token sequence corresponding to the entities and relationships represented in the plurality of situation data structures, and a machine learning trained dynamics transformer computer model that processes the generated token sequence and predicts a subsequent token sequence, subsequent to the generated token sequence, to thereby generate a predicted token sequence. 8. The computer program product of claim 7 , wherein the at least one first machine learning computer model further comprises a sequence decoder, and wherein the sequence decoder generates the predicted one or more second relationships, based on the predicted token sequence, as one or more predicted hypergraph data structures. 9. The computer program product of claim 8 , wherein the at least one second machine learning computer model comprises a language/program parser and program executor, and wherein the computer readable program further causes the data processing system to determine the executable program at least by: processing, by the language/program parser, the input natural language question to predict, based on terms or phrases in the input natural language question, a plurality of program modules to execute to answer the input natural language question, wherein the plurality of program modules are selected as a subset of a set of predefined program modules stored in a program module library; dynamically combining, by a program executor, the plurality of program modules into an executable program that is executed on the one or more predicted hypergraph data structures to generate a final answer to the input natural language question; and outputting, by the data processing system, the final answer to the input natural language question. 10. The computer program product of claim 8 , wherein each hypergraph data structure comprises one or more hyperedges connecting a first situation data structure in the plurality of situation data structures, to at least one second situation data structure. 11. The computer program product of claim 10 , wherein each hyperedge in the one or more hyperedges comprises a predicted action corresponding to at least one first entity in the first situation data structure wi
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
Parsing · CPC title
Knowledge representation; Symbolic representation · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Ensemble learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.