System and method for content comprehension and response
US-2022138433-A1 · May 5, 2022 · US
US12561522B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12561522-B2 |
| Application number | US-202318387728-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 7, 2023 |
| Priority date | Nov 9, 2022 |
| Publication date | Feb 24, 2026 |
| Grant date | Feb 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of performing visual question answering (VQA), including: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model; selecting a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; executing the selected symbolic program by performing a set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determining a natural language answer to the question based on a result of the set of logic operations.
Opening claim text (preview).
What is claimed is: 1 . A method of performing visual question answering (VQA), the method comprising: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model, wherein each symbolic program of the plurality of symbolic programs comprises a set of logic operations and is associated with a corresponding program confidence score from among the plurality of program confidence scores; selecting a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; executing the selected symbolic program by performing the set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determining a natural language answer to the question based on a result of the set of logic operations, wherein the set of logic operations included in the selected symbolic program is associated with a plurality of operation confidence scores, and wherein each logic operation of the set of logic operations is associated with an operation confidence score from among the plurality of operation confidence scores. 2 . The method of claim 1 , wherein the image and the question are received from a user, and wherein the method further comprises providing the natural language answer to the user as a response to the question. 3 . The method of claim 1 , wherein the plurality of feature predictions is associated with a plurality of feature confidence scores generated by the AI scene perception model. 4 . The method of claim 3 , wherein the operation confidence score corresponding to each logic operation of the set of logic operations is determined based on at least one of the program confidence score and the plurality of feature confidence scores. 5 . The method of claim 4 , further comprising: based on at least one confidence score of the plurality of program confidence scores, the plurality of feature confidence scores, and the plurality of operation confidence scores being below a threshold value: obtaining user input corresponding to the at least one confidence score; and adjusting the at least one confidence score based on the user input. 6 . The method of claim 4 , further comprising determining an answer confidence score corresponding to the natural language answer based on the plurality of operation confidence scores. 7 . The method of claim 1 , further comprising: generating augmented training data based on the plurality of symbolic programs; and training the AI scene perception model based on the augmented training data. 8 . The method of claim 7 , wherein the generating of the augmented training data comprises: generating a first ranking of the plurality of symbolic programs based on the plurality of program confidence scores; generating a second ranking of the plurality of symbolic programs based on a plurality of agreement losses between the plurality of program confidence scores and the question; selecting a subset of the plurality of symbolic programs based on the first ranking and the second ranking; and generating the augmented training data based on outputs of the subset of the plurality of symbolic programs. 9 . An apparatus for performing visual question answering (VQA), the apparatus comprising: a memory configured to store instructions; and at least one processor configured to execute the instructions to: obtain an image and a question corresponding to the image; generate a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generate a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model, wherein each symbolic program of the plurality of symbolic programs comprises a set of logic operations and is associated with a corresponding program confidence score from among the plurality of program confidence scores; select a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; execute the selected symbolic program by performing the set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determine a natural language answer to the question based on a result of the set of logic operations, wherein the set of logic operations included in the selected symbolic program is associated with a plurality of operation confidence scores, and wherein each logic operation of the set of logic operations is associated with an operation confidence score from among the plurality of operation confidence scores. 10 . The apparatus of claim 9 , wherein the image and the question are received from a user, and wherein the at least one processor is further configured to execute the instructions to provide the natural language answer to the user as a response to the question. 11 . The apparatus of claim 9 , wherein the plurality of feature predictions is associated with a plurality of feature confidence scores generated by the AI scene perception model. 12 . The apparatus of claim 11 , wherein the operation confidence score corresponding to each logic operation of the set of logic operations is determined based on at least one of the program confidence score and the plurality of feature confidence scores. 13 . The apparatus of claim 12 , wherein the at least one processor is further configured to execute the instructions to: based on at least one confidence score of the plurality of program confidence scores, the plurality of feature confidence scores, and the plurality of operation confidence scores being below a threshold value: obtain user input corresponding to the at least one confidence score; and adjust the at least one confidence score based on the user input. 14 . The apparatus of claim 12 , wherein the at least one processor is further configured to execute the instructions to determine an answer confidence score corresponding to the natural language answer based on the plurality of operation confidence scores. 15 . The apparatus of claim 9 , wherein the at least one processor is further configured to execute the instructions to: generate augmented training data based on the plurality of symbolic programs; and train the AI scene perception model based on the augmented training data. 16 . The apparatus of claim 15 , wherein to generate the augmented training data, the at least one processor is further configured to: generate a first ranking of the plurality of symbolic programs based on the plurality of program confidence scores; generate a second ranking of the plurality of symbolic programs based on a plurality of agreement losses between the plurality of program confidence scores and the question; select a subset of the plurality of symbolic programs based on the first ranking and the second ranking; and generate the augmented training data based on outputs of the subset of the plurality of symbolic programs. 17 . A non-transitory computer readable medium storing instructions which, when executed by at least one processor of a device for performing visual question answering (
using ranking · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Knowledge-based neural networks; Logical representations of neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.