Confidence-based interactable neural-symbolic visual question answering

US12561522B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12561522-B2
Application numberUS-202318387728-A
CountryUS
Kind codeB2
Filing dateNov 7, 2023
Priority dateNov 9, 2022
Publication dateFeb 24, 2026
Grant dateFeb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of performing visual question answering (VQA), including: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model; selecting a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; executing the selected symbolic program by performing a set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determining a natural language answer to the question based on a result of the set of logic operations.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of performing visual question answering (VQA), the method comprising: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model, wherein each symbolic program of the plurality of symbolic programs comprises a set of logic operations and is associated with a corresponding program confidence score from among the plurality of program confidence scores; selecting a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; executing the selected symbolic program by performing the set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determining a natural language answer to the question based on a result of the set of logic operations, wherein the set of logic operations included in the selected symbolic program is associated with a plurality of operation confidence scores, and wherein each logic operation of the set of logic operations is associated with an operation confidence score from among the plurality of operation confidence scores. 2 . The method of claim 1 , wherein the image and the question are received from a user, and wherein the method further comprises providing the natural language answer to the user as a response to the question. 3 . The method of claim 1 , wherein the plurality of feature predictions is associated with a plurality of feature confidence scores generated by the AI scene perception model. 4 . The method of claim 3 , wherein the operation confidence score corresponding to each logic operation of the set of logic operations is determined based on at least one of the program confidence score and the plurality of feature confidence scores. 5 . The method of claim 4 , further comprising: based on at least one confidence score of the plurality of program confidence scores, the plurality of feature confidence scores, and the plurality of operation confidence scores being below a threshold value: obtaining user input corresponding to the at least one confidence score; and adjusting the at least one confidence score based on the user input. 6 . The method of claim 4 , further comprising determining an answer confidence score corresponding to the natural language answer based on the plurality of operation confidence scores. 7 . The method of claim 1 , further comprising: generating augmented training data based on the plurality of symbolic programs; and training the AI scene perception model based on the augmented training data. 8 . The method of claim 7 , wherein the generating of the augmented training data comprises: generating a first ranking of the plurality of symbolic programs based on the plurality of program confidence scores; generating a second ranking of the plurality of symbolic programs based on a plurality of agreement losses between the plurality of program confidence scores and the question; selecting a subset of the plurality of symbolic programs based on the first ranking and the second ranking; and generating the augmented training data based on outputs of the subset of the plurality of symbolic programs. 9 . An apparatus for performing visual question answering (VQA), the apparatus comprising: a memory configured to store instructions; and at least one processor configured to execute the instructions to: obtain an image and a question corresponding to the image; generate a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generate a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model, wherein each symbolic program of the plurality of symbolic programs comprises a set of logic operations and is associated with a corresponding program confidence score from among the plurality of program confidence scores; select a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; execute the selected symbolic program by performing the set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determine a natural language answer to the question based on a result of the set of logic operations, wherein the set of logic operations included in the selected symbolic program is associated with a plurality of operation confidence scores, and wherein each logic operation of the set of logic operations is associated with an operation confidence score from among the plurality of operation confidence scores. 10 . The apparatus of claim 9 , wherein the image and the question are received from a user, and wherein the at least one processor is further configured to execute the instructions to provide the natural language answer to the user as a response to the question. 11 . The apparatus of claim 9 , wherein the plurality of feature predictions is associated with a plurality of feature confidence scores generated by the AI scene perception model. 12 . The apparatus of claim 11 , wherein the operation confidence score corresponding to each logic operation of the set of logic operations is determined based on at least one of the program confidence score and the plurality of feature confidence scores. 13 . The apparatus of claim 12 , wherein the at least one processor is further configured to execute the instructions to: based on at least one confidence score of the plurality of program confidence scores, the plurality of feature confidence scores, and the plurality of operation confidence scores being below a threshold value: obtain user input corresponding to the at least one confidence score; and adjust the at least one confidence score based on the user input. 14 . The apparatus of claim 12 , wherein the at least one processor is further configured to execute the instructions to determine an answer confidence score corresponding to the natural language answer based on the plurality of operation confidence scores. 15 . The apparatus of claim 9 , wherein the at least one processor is further configured to execute the instructions to: generate augmented training data based on the plurality of symbolic programs; and train the AI scene perception model based on the augmented training data. 16 . The apparatus of claim 15 , wherein to generate the augmented training data, the at least one processor is further configured to: generate a first ranking of the plurality of symbolic programs based on the plurality of program confidence scores; generate a second ranking of the plurality of symbolic programs based on a plurality of agreement losses between the plurality of program confidence scores and the question; select a subset of the plurality of symbolic programs based on the first ranking and the second ranking; and generate the augmented training data based on outputs of the subset of the plurality of symbolic programs. 17 . A non-transitory computer readable medium storing instructions which, when executed by at least one processor of a device for performing visual question answering (

Assignees

Inventors

Classifications

  • using ranking · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • Knowledge-based neural networks; Logical representations of neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12561522B2 cover?
A method of performing visual question answering (VQA), including: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by …
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/205. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).