Method and device for detecting objects from scene images by using dynamic knowledge base
US-2019205706-A1 · Jul 4, 2019 · US
US11663249B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11663249-B2 |
| Application number | US-201816650853-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 30, 2018 |
| Priority date | Jan 30, 2018 |
| Publication date | May 30, 2023 |
| Grant date | May 30, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An example apparatus for visual question answering includes a receiver to receive an input image and a question. The apparatus also includes an encoder to encode the input image and the question into a query representation including visual attention features. The apparatus includes a knowledge spotter to retrieve a knowledge entry from a visual knowledge base pre-built on a set of question-answer pairs. The apparatus further includes a joint embedder to jointly embed the visual attention features and the knowledge entry to generate visual-knowledge features. The apparatus also further includes an answer generator to generate an answer based on the query representation and the visual-knowledge features.
Opening claim text (preview).
What is claimed is: 1. An apparatus for visual question answering, comprising: an encoder to: encode an input image into an image vector based on a first model; encode a question into a question vector based on a second model; generate a visual attention feature with a multimodal low-rank bilinear attention network based on the image vector and the question vector; and generate a query representation that includes the question vector and the visual attention feature; a knowledge spotter to retrieve a knowledge entry from a visual knowledge base based on the question, the visual knowledge base pre-built based on question-answer pairs; a joint embedder to jointly embed the visual attention feature and the knowledge entry to generate visual-knowledge features; and an answer generator to generate an answer based on the query representation and the visual-knowledge features. 2. The apparatus of claim 1 , wherein the knowledge entry includes a knowledge triple or a subset of a knowledge triple. 3. The apparatus of claim 1 , wherein the knowledge spotter is to retrieve the knowledge entry from the visual knowledge base using subgraph hashing. 4. The apparatus of claim 1 , wherein the encoder includes a convolutional neural network (CNN) model as the first model to encode the input image into the image vector, the image vector to include image embedding features. 5. The apparatus of claim 1 , wherein the encoder includes a long short-term memory (LSTM) model as the second model to encode the question into the question vector, the question vector to include question embedding features. 6. The apparatus of claim 1 , wherein the encoder is to jointly embed the image vector from a convolutional neural network (CNN) model and the question vector from a long short-term memory (LSTM) model to generate the query representation. 7. The apparatus of claim 1 , wherein the encoder includes the multimodal low-rank bilinear attention network. 8. The apparatus of claim 1 , wherein the answer generator includes a fully connected neural network, the fully connected neural network to: receive a plurality of values related to the query representation from a visual knowledge memory network; and output a single answer corresponding to a value with a higher score than other values in the plurality of values. 9. The apparatus of claim 1 , wherein the answer generator includes a visual knowledge memory network, the visual knowledge memory network to: store the visual-knowledge features as key-value pairs, receive the query representation, and output a plurality of values related to the query representation. 10. The apparatus of claim 1 , wherein the answer generator is to generate the answer by reading a key-value pair of the visual-knowledge features corresponding to the query representation and generate the answer based on the key-value pair. 11. A method for answering visual questions, comprising: encoding, by executing an instruction with a processor, an input image into an image vector using a first model and a question into a question vector using a second model; generating, by executing an instruction with the processor, a visual attention feature using a multimodal low-rank bilinear attention network based on the image vector and the question vector; generating, by executing an instruction with the processor, a query representation that includes the question vector and the visual attention feature; retrieving, by executing an instruction with the processor, a knowledge entry from a visual knowledge base based on the question, the visual knowledge base pre-built based on question-answer pairs; jointly embedding, by executing an instruction with the processor, the visual attention feature and the knowledge entry to generate visual-knowledge features; and generating, by executing an instruction with the processor, an answer based on the query representation and the visual-knowledge features. 12. The method of claim 11 , wherein the encoding of the input image into the image vector using the first model includes encoding the input image into the image vector via a convolutional neural network (CNN) model, the image vector to include image embedding features. 13. The method of claim 11 , wherein the encoding of the question into the question vector using the second model includes encoding the question into the question vector via a long short-term memory (LSTM) model, the question vector to include question embedding features. 14. The method of claim 11 , wherein the generating of the query representation includes jointly embedding the image vector from a convolutional neural network (CNN) model and the question vector from a long short-term memory (LSTM) model to generate the query representation. 15. The method of claim 11 , wherein the retrieving of the knowledge entry includes using subgraph hashing. 16. The method of claim 11 , further including storing the visual-knowledge features as key-value pairs in a visual knowledge memory network. 17. The method of claim 11 , wherein the generating of the answer includes reading a key-value pair of the visual-knowledge features corresponding to the query representation and generating the answer based on the key-value pair. 18. The method of claim 11 , wherein the generating of the answer includes receiving a plurality of values related to the query representation at a fully connected layer from a visual knowledge memory network and outputting a single answer corresponding to a value with a higher score than other values in the plurality of values. 19. The method of claim 11 , further including using multimodal low-rank bilinear pooling to generate the visual attention feature. 20. The method of claim 11 , further including using multimodal low-rank bilinear pooling to extract the visual attention feature from the image vector and the question vector, the image vector from a convolutional neural network (CNN) model and the question vector from a long short-term memory (LSTM) model. 21. At least one storage device comprising instructions that, in response to being executed on a computing device, cause the computing device to at least: encode an input image into an image vector based on a first model; encode a question into a question vector based on a second model; generate a visual attention feature with a multimodal low-rank bilinear attention network based on the image vector and the question vector; and generate a query representation that includes the question vector and the visual attention feature; retrieve a knowledge entry from a visual knowledge base based on the question, the visual knowledge base pre-built based on question-answer pairs; jointly embed the visual attention features and knowledge entry to generate visual-knowledge features; and generate an answer based on the query representation and the visual-knowledge features. 22. The at least one storage device of claim 21 , wherein the instructions are to cause the computing device to retrieve the knowledge entry from the visual knowledge base using subgraph hashing. 23. The at least one storage device of claim 21 , wherein the instructions are to cause the computing device to store the visual-knowledge features as key-value pairs in a visual knowledge memory network. 24. The at least one storage device of claim 21 , wherein the instructions are to cause the computing device to use multimodal low-rank bilinear pooling to gen
Auto-encoder networks; Encoder-decoder networks · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Natural language query formulation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.