Method, apparatus, and computer-readable medium for postal address identification
US-2024428099-A1 · Dec 26, 2024 · US
US9965705B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9965705-B2 |
| Application number | US-201615184991-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 16, 2016 |
| Priority date | Nov 3, 2015 |
| Publication date | May 8, 2018 |
| Grant date | May 8, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described herein are systems and methods for generating and using attention-based deep learning architectures for visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. To generate the correct answers, it is important for a model's attention to focus on the relevant regions of an image according to the question because different questions may ask about the attributes of different image regions. In embodiments, such question-guided attention is learned with a configurable convolutional neural network (ABC-CNN). Embodiments of the ABC-CNN models determine the attention maps by convolving image feature map with the configurable convolutional kernels determined by the questions semantics. In embodiments, the question-guided attention maps focus on the question-related regions and filters out noise in the unrelated regions.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method of improving accuracy in generating an answer to a question input related to an image input, the method comprising: receiving an image input; receiving a question input related to the image input; inputting the question input and the image input into an Attention-Based Configurable Convolutional Neural Networks (ABC-CNN) framework to generate an answer, the ABC-CNN framework comprising: an image feature map extraction component comprising a CNN that extracts an image feature map from the image input; a semantic question embedding component that obtains question embeddings from the question input; a question-guided attention map generation component that receives the image feature map and the question embeddings and that obtains a question-guided attention map focusing on a region or regions asked by question input; and an answer generation component that obtains an attention weighted image feature map by weighting image feature map using the question-guided attention map and generates answers based on a fusion of the image feature map, the question embeddings, and the attention weighted image feature map. 2. The computer-implemented method of claim 1 wherein the semantic question embedding part comprises a long short term memory (LSTM) layer to generate the question embeddings to characterize semantic meanings of the question input. 3. The computer-implemented method of claim 1 wherein the question-guided attention map generation part comprises configurable convolutional kernels produced by projecting the question embeddings from a semantic space into a visual space and utilized to convolve with the image feature map to produce the question-guided attention map. 4. The computer-implemented method of claim 3 wherein the convolutional kernels have the same number of channels as the image feature map. 5. The computer-implemented method of claim 3 wherein the question-guided attention map has the same size as the image feature map. 6. The computer-implemented method of claim 1 wherein the image feature map is extracted by dividing the image input into a plurality of grids, and extracting a D-dimension feature vector in each cell of the grids. 7. The computer-implemented method of claim 1 wherein the image feature map is spatially weighted by the question-guided attention map to obtain the attention weighted image feature map. 8. The computer-implemented method of claim 7 wherein the spatial weighting is achieved by element-wise production between each channel of the image feature map and the question-guided attention map. 9. The computer-implemented method of claim 8 wherein the spatial weighting is further defined by softmax normalization for a spatial attention distribution. 10. The computer-implemented method of claim 1 wherein the ABC-CNN framework is pre-trained in an end-to-end way with stochastic gradient descent. 11. The question-guided attention-based deep learning method of claim 10 wherein the ABC-CNN framework has initialization weights randomly adjusted to ensure that each dimension of the activations of all layers within the ABC-CNN framework has zero mean and one standard derivation during pre-training. 12. A computer-implemented method of generating an answer to a question related to an image, the method comprising steps of: extracting an image feature map from an input image comprising a plurality of pixels using a deep convolutional neural network; obtaining a dense question embedding from an input question related to the input image using a long short term memory (LSTM) layer; producing a plurality of question-configured kernels by projecting the dense question embedding from semantic space into visual space; convolving the question-configured kernels with the image feature map to generate a question-guided attention map; obtaining at a multi-class classifier an attention weighted image feature map by spatially weighting the image feature map using the question-guided attention map, the attention weighted feature map lowering weights of regions irrelevant to the question; and generating an answer to the question based on a fusion of the image feature map, the deep question embedding, and the attention weighted image feature map. 13. The method of claim 12 wherein the spatial weighting is achieved by element-wise production between each channel of the image feature map and the question-guided attention map. 14. The method of claim 12 wherein the question-guided attention map adaptively represents each pixel's degree of attention according to the input question. 15. The method of claim 12 wherein the question-guided attention map is obtained by applying the question-configured kernels on the image feature map. 16. The method of claim 12 wherein the image feature map, the deep question embedding, and the attention weighted image feature map are fused by a nonlinear projection. 17. The method of claim 16 wherein the nonlinear projection is an element-wise scaled hyperbolic tangent function. 18. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising: responsive to receiving a question input, extracting a dense question embedding of the question input; responsive to receiving an image input related to the question input, generating an image feature map; generating a question-guided attention map based on at least the image feature map and the dense question embedding, the question-guided attention map selectively focusing on areas queried by the question input; spatially weighting the image feature map using the question-guided attention map to obtain an attention weighted image; and fusing semantic information, the image feature map, and the attention weighted image to generate an answer to the question input. 19. The non-transitory computer-readable medium or media of claim 18 wherein generating a question-guided attention map further comprises softmax normalization a spatial attention distribution of the attention map. 20. The non-transitory computer-readable medium or media of claim 19 wherein generating a question-guided attention map comprises configuring a set of convolutional kernels according to the dense question embedding and applying the convolutional kernels on the image feature map to generate question-guided attention map.
Inference or reasoning models · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Classification techniques · CPC title
Combinations of networks · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.