What technology area does this patent fall under?

Primary CPC classification G06F16/3329. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Visual question answering using visual knowledge bases

US11663249B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11663249-B2
Application number	US-201816650853-A
Country	US
Kind code	B2
Filing date	Jan 30, 2018
Priority date	Jan 30, 2018
Publication date	May 30, 2023
Grant date	May 30, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An example apparatus for visual question answering includes a receiver to receive an input image and a question. The apparatus also includes an encoder to encode the input image and the question into a query representation including visual attention features. The apparatus includes a knowledge spotter to retrieve a knowledge entry from a visual knowledge base pre-built on a set of question-answer pairs. The apparatus further includes a joint embedder to jointly embed the visual attention features and the knowledge entry to generate visual-knowledge features. The apparatus also further includes an answer generator to generate an answer based on the query representation and the visual-knowledge features.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus for visual question answering, comprising: an encoder to: encode an input image into an image vector based on a first model; encode a question into a question vector based on a second model; generate a visual attention feature with a multimodal low-rank bilinear attention network based on the image vector and the question vector; and generate a query representation that includes the question vector and the visual attention feature; a knowledge spotter to retrieve a knowledge entry from a visual knowledge base based on the question, the visual knowledge base pre-built based on question-answer pairs; a joint embedder to jointly embed the visual attention feature and the knowledge entry to generate visual-knowledge features; and an answer generator to generate an answer based on the query representation and the visual-knowledge features. 2. The apparatus of claim 1 , wherein the knowledge entry includes a knowledge triple or a subset of a knowledge triple. 3. The apparatus of claim 1 , wherein the knowledge spotter is to retrieve the knowledge entry from the visual knowledge base using subgraph hashing. 4. The apparatus of claim 1 , wherein the encoder includes a convolutional neural network (CNN) model as the first model to encode the input image into the image vector, the image vector to include image embedding features. 5. The apparatus of claim 1 , wherein the encoder includes a long short-term memory (LSTM) model as the second model to encode the question into the question vector, the question vector to include question embedding features. 6. The apparatus of claim 1 , wherein the encoder is to jointly embed the image vector from a convolutional neural network (CNN) model and the question vector from a long short-term memory (LSTM) model to generate the query representation. 7. The apparatus of claim 1 , wherein the encoder includes the multimodal low-rank bilinear attention network. 8. The apparatus of claim 1 , wherein the answer generator includes a fully connected neural network, the fully connected neural network to: receive a plurality of values related to the query representation from a visual knowledge memory network; and output a single answer corresponding to a value with a higher score than other values in the plurality of values. 9. The apparatus of claim 1 , wherein the answer generator includes a visual knowledge memory network, the visual knowledge memory network to: store the visual-knowledge features as key-value pairs, receive the query representation, and output a plurality of values related to the query representation. 10. The apparatus of claim 1 , wherein the answer generator is to generate the answer by reading a key-value pair of the visual-knowledge features corresponding to the query representation and generate the answer based on the key-value pair. 11. A method for answering visual questions, comprising: encoding, by executing an instruction with a processor, an input image into an image vector using a first model and a question into a question vector using a second model; generating, by executing an instruction with the processor, a visual attention feature using a multimodal low-rank bilinear attention network based on the image vector and the question vector; generating, by executing an instruction with the processor, a query representation that includes the question vector and the visual attention feature; retrieving, by executing an instruction with the processor, a knowledge entry from a visual knowledge base based on the question, the visual knowledge base pre-built based on question-answer pairs; jointly embedding, by executing an instruction with the processor, the visual attention feature and the knowledge entry to generate visual-knowledge features; and generating, by executing an instruction with the processor, an answer based on the query representation and the visual-knowledge features. 12. The method of claim 11 , wherein the encoding of the input image into the image vector using the first model includes encoding the input image into the image vector via a convolutional neural network (CNN) model, the image vector to include image embedding features. 13. The method of claim 11 , wherein the encoding of the question into the question vector using the second model includes encoding the question into the question vector via a long short-term memory (LSTM) model, the question vector to include question embedding features. 14. The method of claim 11 , wherein the generating of the query representation includes jointly embedding the image vector from a convolutional neural network (CNN) model and the question vector from a long short-term memory (LSTM) model to generate the query representation. 15. The method of claim 11 , wherein the retrieving of the knowledge entry includes using subgraph hashing. 16. The method of claim 11 , further including storing the visual-knowledge features as key-value pairs in a visual knowledge memory network. 17. The method of claim 11 , wherein the generating of the answer includes reading a key-value pair of the visual-knowledge features corresponding to the query representation and generating the answer based on the key-value pair. 18. The method of claim 11 , wherein the generating of the answer includes receiving a plurality of values related to the query representation at a fully connected layer from a visual knowledge memory network and outputting a single answer corresponding to a value with a higher score than other values in the plurality of values. 19. The method of claim 11 , further including using multimodal low-rank bilinear pooling to generate the visual attention feature. 20. The method of claim 11 , further including using multimodal low-rank bilinear pooling to extract the visual attention feature from the image vector and the question vector, the image vector from a convolutional neural network (CNN) model and the question vector from a long short-term memory (LSTM) model. 21. At least one storage device comprising instructions that, in response to being executed on a computing device, cause the computing device to at least: encode an input image into an image vector based on a first model; encode a question into a question vector based on a second model; generate a visual attention feature with a multimodal low-rank bilinear attention network based on the image vector and the question vector; and generate a query representation that includes the question vector and the visual attention feature; retrieve a knowledge entry from a visual knowledge base based on the question, the visual knowledge base pre-built based on question-answer pairs; jointly embed the visual attention features and knowledge entry to generate visual-knowledge features; and generate an answer based on the query representation and the visual-knowledge features. 22. The at least one storage device of claim 21 , wherein the instructions are to cause the computing device to retrieve the knowledge entry from the visual knowledge base using subgraph hashing. 23. The at least one storage device of claim 21 , wherein the instructions are to cause the computing device to store the visual-knowledge features as key-value pairs in a visual knowledge memory network. 24. The at least one storage device of claim 21 , wherein the instructions are to cause the computing device to use multimodal low-rank bilinear pooling to gen

Assignees

Intel Corp

Inventors

Classifications

G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06F16/3329Primary
Natural language query formulation · CPC title

Patent family

Related publications grouped by family.

View patent family 67477805

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11663249B2 cover?: An example apparatus for visual question answering includes a receiver to receive an input image and a question. The apparatus also includes an encoder to encode the input image and the question into a query representation including visual attention features. The apparatus includes a knowledge spotter to retrieve a knowledge entry from a visual knowledge base pre-built on a set of question-answ…
Who is the assignee on this patent?: Intel Corp
What technology area does this patent fall under?: Primary CPC classification G06F16/3329. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).