What technology area does this patent fall under?

Primary CPC classification G06F40/205. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Confidence-based interactable neural-symbolic visual question answering

US12561522B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12561522-B2
Application number	US-202318387728-A
Country	US
Kind code	B2
Filing date	Nov 7, 2023
Priority date	Nov 9, 2022
Publication date	Feb 24, 2026
Grant date	Feb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of performing visual question answering (VQA), including: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model; selecting a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; executing the selected symbolic program by performing a set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determining a natural language answer to the question based on a result of the set of logic operations.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of performing visual question answering (VQA), the method comprising: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model, wherein each symbolic program of the plurality of symbolic programs comprises a set of logic operations and is associated with a corresponding program confidence score from among the plurality of program confidence scores; selecting a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; executing the selected symbolic program by performing the set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determining a natural language answer to the question based on a result of the set of logic operations, wherein the set of logic operations included in the selected symbolic program is associated with a plurality of operation confidence scores, and wherein each logic operation of the set of logic operations is associated with an operation confidence score from among the plurality of operation confidence scores. 2 . The method of claim 1 , wherein the image and the question are received from a user, and wherein the method further comprises providing the natural language answer to the user as a response to the question. 3 . The method of claim 1 , wherein the plurality of feature predictions is associated with a plurality of feature confidence scores generated by the AI scene perception model. 4 . The method of claim 3 , wherein the operation confidence score corresponding to each logic operation of the set of logic operations is determined based on at least one of the program confidence score and the plurality of feature confidence scores. 5 . The method of claim 4 , further comprising: based on at least one confidence score of the plurality of program confidence scores, the plurality of feature confidence scores, and the plurality of operation confidence scores being below a threshold value: obtaining user input corresponding to the at least one confidence score; and adjusting the at least one confidence score based on the user input. 6 . The method of claim 4 , further comprising determining an answer confidence score corresponding to the natural language answer based on the plurality of operation confidence scores. 7 . The method of claim 1 , further comprising: generating augmented training data based on the plurality of symbolic programs; and training the AI scene perception model based on the augmented training data. 8 . The method of claim 7 , wherein the generating of the augmented training data comprises: generating a first ranking of the plurality of symbolic programs based on the plurality of program confidence scores; generating a second ranking of the plurality of symbolic programs based on a plurality of agreement losses between the plurality of program confidence scores and the question; selecting a subset of the plurality of symbolic programs based on the first ranking and the second ranking; and generating the augmented training data based on outputs of the subset of the plurality of symbolic programs. 9 . An apparatus for performing visual question answering (VQA), the apparatus comprising: a memory configured to store instructions; and at least one processor configured to execute the instructions to: obtain an image and a question corresponding to the image; generate a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generate a plurality of symbolic programs and a plurality of program confidence scores by providing the question to an AI question parsing model, wherein each symbolic program of the plurality of symbolic programs comprises a set of logic operations and is associated with a corresponding program confidence score from among the plurality of program confidence scores; select a symbolic program associated with a program confidence score which is highest among the plurality of program confidence scores; execute the selected symbolic program by performing the set of logic operations included in the selected symbolic program on the plurality of feature predictions; and determine a natural language answer to the question based on a result of the set of logic operations, wherein the set of logic operations included in the selected symbolic program is associated with a plurality of operation confidence scores, and wherein each logic operation of the set of logic operations is associated with an operation confidence score from among the plurality of operation confidence scores. 10 . The apparatus of claim 9 , wherein the image and the question are received from a user, and wherein the at least one processor is further configured to execute the instructions to provide the natural language answer to the user as a response to the question. 11 . The apparatus of claim 9 , wherein the plurality of feature predictions is associated with a plurality of feature confidence scores generated by the AI scene perception model. 12 . The apparatus of claim 11 , wherein the operation confidence score corresponding to each logic operation of the set of logic operations is determined based on at least one of the program confidence score and the plurality of feature confidence scores. 13 . The apparatus of claim 12 , wherein the at least one processor is further configured to execute the instructions to: based on at least one confidence score of the plurality of program confidence scores, the plurality of feature confidence scores, and the plurality of operation confidence scores being below a threshold value: obtain user input corresponding to the at least one confidence score; and adjust the at least one confidence score based on the user input. 14 . The apparatus of claim 12 , wherein the at least one processor is further configured to execute the instructions to determine an answer confidence score corresponding to the natural language answer based on the plurality of operation confidence scores. 15 . The apparatus of claim 9 , wherein the at least one processor is further configured to execute the instructions to: generate augmented training data based on the plurality of symbolic programs; and train the AI scene perception model based on the augmented training data. 16 . The apparatus of claim 15 , wherein to generate the augmented training data, the at least one processor is further configured to: generate a first ranking of the plurality of symbolic programs based on the plurality of program confidence scores; generate a second ranking of the plurality of symbolic programs based on a plurality of agreement losses between the plurality of program confidence scores and the question; select a subset of the plurality of symbolic programs based on the first ranking and the second ranking; and generate the augmented training data based on outputs of the subset of the plurality of symbolic programs. 17 . A non-transitory computer readable medium storing instructions which, when executed by at least one processor of a device for performing visual question answering (

Assignees

Samsung Electronics Co Ltd

Inventors

Classifications

G06F16/24578
using ranking · CPC title
G06N5/01
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
G06N3/006
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
G06N3/084
Backpropagation, e.g. using gradient descent · CPC title
G06N3/042
Knowledge-based neural networks; Logical representations of neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 91028058

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12561522B2 cover?: A method of performing visual question answering (VQA), including: obtaining an image and a question corresponding to the image; generating a plurality of feature predictions about at least one object included in the image by providing the image to an artificial intelligence (AI) scene perception model; generating a plurality of symbolic programs and a plurality of program confidence scores by …
Who is the assignee on this patent?: Samsung Electronics Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06F40/205. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).