What technology area does this patent fall under?

Primary CPC classification G06F40/284. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multi-modal visual question answering system

US10949718B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10949718-B2
Application number	US-201916406380-A
Country	US
Kind code	B2
Filing date	May 8, 2019
Priority date	May 8, 2019
Publication date	Mar 16, 2021
Grant date	Mar 16, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The systems and methods described herein may generate multi-modal embeddings with sub-symbolic features and symbolic features. The sub-symbolic embeddings may be generated with computer vision processing. The symbolic features may include mathematical representations of image content, which are enriched with information from background knowledge sources. The system may aggregate the sub-symbolic and symbolic features using aggregation techniques such as concatenation, averaging, summing, and/or maxing. The multi-modal embeddings may be included in a multi-modal embedding model and trained via supervised learning. Once the multi-modal embeddings are trained, the system may generate inferences based on linear algebra operations involving the multi-modal embeddings that are relevant to an inference response to the natural language question and input image.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for visual question inference, the method comprising: receiving an input image and a natural language query; determining content classifications for portions of the input image; generating a scene graph for the input image, the scene graph including the content classifications arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; generating multi-modal embeddings based on the input image and the scene graph, the multi-modal embeddings being respectively associated with the nodes, the edges, or any combination thereof, wherein at least a portion of the multi-modal embeddings are generated by: determining symbolic embeddings for the content classifications of the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determining a sub-symbolic embedding for the input image, the sub-symbolic embedding comprising an image feature vector for the input image; identifying separate portions of the image feature vector that are representative of the portions of the input image; generating weighted sub-symbolic embeddings for each of the content classifications by applying weight values to the separate portions of the image feature vector; aggregating the symbolic embeddings with the weighted sub-symbolic embeddings to form at least the portion of the multi-modal embeddings; generating a natural language response to the natural language query based on the multi-modal embeddings by: generating an inference query based on the natural language query, the inference query indicative of the at least one of the content classifications; selecting, from the multi-modal embeddings, particular multi-modal embeddings associated with at least one of the content classifications; determining an inference statement based on a distance measurement between the particular multi-modal embeddings; and determining the natural language response based on the inference statement; and displaying, in response to receipt of the natural language query and the input image, the natural language response. 2. The method of claim 1 , wherein aggregating the symbolic embeddings with the weighted sub-symbolic embeddings to form the multi-modal embeddings further comprises: concatenating a first vector from the symbolic embeddings with a second vector from the weighted sub-symbolic embeddings to form a multi-modal vector. 3. The method of claim 1 , wherein determining an inference statement based on a distance measurement between the particular multi-modal embeddings further comprises: generating a plurality of candidate statements, each of the candidate statements referencing at least one node and at least one edge of the scene graph; selecting, from the multi-modal embeddings, groups of multi-modal embeddings based on the at least one node and the at least one edge of the scene graph; determining respective scores for the plurality of candidate statements based on distance measurements between multi-modal embeddings in each of the groups; selecting, based on the respective scores, at least one of the candidate statements; and generating the natural language response based on the selected at least one of the selected candidate statements. 4. The method of claim 3 , wherein selecting, based on the respective scores, at least one of the candidate statements further comprises: selecting a candidate statement associated with a highest one of the respective scores. 5. The method of claim 1 , further comprising: enriching the scene graph by appending additional nodes to the scene graph with nodes being sourced from a background knowledge graph; and generating the multi-modal embeddings based on the input image and the enriched scene graph. 6. The method of claim 5 , wherein enriching the scene graph by appending additional nodes to the scene graph with the additional nodes being sourced from a background knowledge graph further comprises: identifying, in the background knowledge graph, a first node of the scene graph that corresponds to a second node of the background knowledge graph; selecting further nodes of the background knowledge graph that are connected with the second node of the background knowledge graph, wherein the selected further nodes are not included in the non-enriched scene graph; and appending the selected further nodes to the scene graph. 7. The method of claim 1 , further comprising: generating a graphical user interface, the graphical user interface comprising the input image and a text field; determining that the natural language query was inserted into the text field; and updating the graphical user interface to include the natural language response. 8. A system for visual question inference, the system comprising: a processor, the processor configured to: receive an input image and a natural language query; generate a scene graph for the input image, the scene graph comprising content classifications of image data for the input image, the content classifications being arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; determine symbolic embeddings for the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determine a sub-symbolic embeddings for the input image, the sub-symbolic embeddings comprising respective image feature vectors for the input image; aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space; identify at least one node and at least one edge in the scene graph based on natural language included in the natural language query, the natural language text being indicative of at least one of the content classifications; select, from the multi-modal embeddings, particular multi-modal embeddings associated with the at least one of the content classifications; determine an inference statement based on a distance measurement between the selected multi-modal embeddings; generate a natural language response based on the inference statement; and display the natural language in response on a graphical user interface. 9. The system of claim 8 , wherein to determine a sub-symbolic embeddings for the input image, the processor is further configured to: determine at least a portion of the input image that corresponds to the content classifications; generate an initial image feature vector for the input image; identify separate portions of the initial image feature vector, the separate portions of the initial image feature vector being representative of the at least the portion of the input image; apply weight values to the separate portions of the image feature vector; extract the separate weighted portions of the image feature vector; and generate the respective image feature vectors of the sub-symbolic embeddings, wherein each of the respective image feature vectors comprise a corresponding one of the separate weighted portions of the image feature vector. 10. The system of claim 8 , wherein to aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space, the processor is further configured to: concatenate a first feature vector from the symbolic embeddings with a second feature v

Assignees

Accenture Global Solutions Ltd

Inventors

Classifications

G06F40/284Primary
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06V30/19173
Classification techniques · CPC title
G06V30/274
Syntactic or semantic context, e.g. balancing · CPC title
G06F18/24
Classification techniques · CPC title
G06F40/56
Natural language generation · CPC title

Patent family

Related publications grouped by family.

View patent family 73046090

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10949718B2 cover?: The systems and methods described herein may generate multi-modal embeddings with sub-symbolic features and symbolic features. The sub-symbolic embeddings may be generated with computer vision processing. The symbolic features may include mathematical representations of image content, which are enriched with information from background knowledge sources. The system may aggregate the sub-symboli…
Who is the assignee on this patent?: Accenture Global Solutions Ltd
What technology area does this patent fall under?: Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Active learning method for temporal action localization in untrimmed videos

Determining explanations for predicted links in knowledge graphs

Deep compositional frameworks for human-like language acquisition in virtual environments

Predicting links in knowledge graphs using ontological knowledge

Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering

Frequently asked questions