Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering

US9965705B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9965705-B2
Application numberUS-201615184991-A
CountryUS
Kind codeB2
Filing dateJun 16, 2016
Priority dateNov 3, 2015
Publication dateMay 8, 2018
Grant dateMay 8, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are systems and methods for generating and using attention-based deep learning architectures for visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. To generate the correct answers, it is important for a model's attention to focus on the relevant regions of an image according to the question because different questions may ask about the attributes of different image regions. In embodiments, such question-guided attention is learned with a configurable convolutional neural network (ABC-CNN). Embodiments of the ABC-CNN models determine the attention maps by convolving image feature map with the configurable convolutional kernels determined by the questions semantics. In embodiments, the question-guided attention maps focus on the question-related regions and filters out noise in the unrelated regions.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method of improving accuracy in generating an answer to a question input related to an image input, the method comprising: receiving an image input; receiving a question input related to the image input; inputting the question input and the image input into an Attention-Based Configurable Convolutional Neural Networks (ABC-CNN) framework to generate an answer, the ABC-CNN framework comprising: an image feature map extraction component comprising a CNN that extracts an image feature map from the image input; a semantic question embedding component that obtains question embeddings from the question input; a question-guided attention map generation component that receives the image feature map and the question embeddings and that obtains a question-guided attention map focusing on a region or regions asked by question input; and an answer generation component that obtains an attention weighted image feature map by weighting image feature map using the question-guided attention map and generates answers based on a fusion of the image feature map, the question embeddings, and the attention weighted image feature map. 2. The computer-implemented method of claim 1 wherein the semantic question embedding part comprises a long short term memory (LSTM) layer to generate the question embeddings to characterize semantic meanings of the question input. 3. The computer-implemented method of claim 1 wherein the question-guided attention map generation part comprises configurable convolutional kernels produced by projecting the question embeddings from a semantic space into a visual space and utilized to convolve with the image feature map to produce the question-guided attention map. 4. The computer-implemented method of claim 3 wherein the convolutional kernels have the same number of channels as the image feature map. 5. The computer-implemented method of claim 3 wherein the question-guided attention map has the same size as the image feature map. 6. The computer-implemented method of claim 1 wherein the image feature map is extracted by dividing the image input into a plurality of grids, and extracting a D-dimension feature vector in each cell of the grids. 7. The computer-implemented method of claim 1 wherein the image feature map is spatially weighted by the question-guided attention map to obtain the attention weighted image feature map. 8. The computer-implemented method of claim 7 wherein the spatial weighting is achieved by element-wise production between each channel of the image feature map and the question-guided attention map. 9. The computer-implemented method of claim 8 wherein the spatial weighting is further defined by softmax normalization for a spatial attention distribution. 10. The computer-implemented method of claim 1 wherein the ABC-CNN framework is pre-trained in an end-to-end way with stochastic gradient descent. 11. The question-guided attention-based deep learning method of claim 10 wherein the ABC-CNN framework has initialization weights randomly adjusted to ensure that each dimension of the activations of all layers within the ABC-CNN framework has zero mean and one standard derivation during pre-training. 12. A computer-implemented method of generating an answer to a question related to an image, the method comprising steps of: extracting an image feature map from an input image comprising a plurality of pixels using a deep convolutional neural network; obtaining a dense question embedding from an input question related to the input image using a long short term memory (LSTM) layer; producing a plurality of question-configured kernels by projecting the dense question embedding from semantic space into visual space; convolving the question-configured kernels with the image feature map to generate a question-guided attention map; obtaining at a multi-class classifier an attention weighted image feature map by spatially weighting the image feature map using the question-guided attention map, the attention weighted feature map lowering weights of regions irrelevant to the question; and generating an answer to the question based on a fusion of the image feature map, the deep question embedding, and the attention weighted image feature map. 13. The method of claim 12 wherein the spatial weighting is achieved by element-wise production between each channel of the image feature map and the question-guided attention map. 14. The method of claim 12 wherein the question-guided attention map adaptively represents each pixel's degree of attention according to the input question. 15. The method of claim 12 wherein the question-guided attention map is obtained by applying the question-configured kernels on the image feature map. 16. The method of claim 12 wherein the image feature map, the deep question embedding, and the attention weighted image feature map are fused by a nonlinear projection. 17. The method of claim 16 wherein the nonlinear projection is an element-wise scaled hyperbolic tangent function. 18. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising: responsive to receiving a question input, extracting a dense question embedding of the question input; responsive to receiving an image input related to the question input, generating an image feature map; generating a question-guided attention map based on at least the image feature map and the dense question embedding, the question-guided attention map selectively focusing on areas queried by the question input; spatially weighting the image feature map using the question-guided attention map to obtain an attention weighted image; and fusing semantic information, the image feature map, and the attention weighted image to generate an answer to the question input. 19. The non-transitory computer-readable medium or media of claim 18 wherein generating a question-guided attention map further comprises softmax normalization a spatial attention distribution of the attention map. 20. The non-transitory computer-readable medium or media of claim 19 wherein generating a question-guided attention map comprises configuring a set of convolutional kernels according to the dense question embedding and applying the convolutional kernels on the image feature map to generate question-guided attention map.

Assignees

Inventors

Classifications

  • G06N5/04Primary

    Inference or reasoning models · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Classification techniques · CPC title

  • Combinations of networks · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9965705B2 cover?
Described herein are systems and methods for generating and using attention-based deep learning architectures for visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. To generate the correct answers, it is important for a model's attention to focus on the relevant regions of an image according to the question because differe…
Who is the assignee on this patent?
Baidu Usa Llc, Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G06N5/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 08 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).