Method and device for visual question answering, computer apparatus and medium

US11768876B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11768876-B2
Application numberUS-202117161466-A
CountryUS
Kind codeB2
Filing dateJan 28, 2021
Priority dateJun 30, 2020
Publication dateSep 26, 2023
Grant dateSep 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides a method for visual question answering, which relates to a field of computer vision and natural language processing. The method includes: acquiring an input image and an input question; constructing a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; updating the Node Feature by using the Node Feature and the Edge Feature to obtain an updated Visual Graph; determining a question feature based on the input question; fusing the updated Visual Graph and the question feature to obtain a fused feature; and generating a predicted answer for the input image and the input question based on the fused feature. The present disclosure further provides an apparatus for visual question answering, a computer device and a non-transitory computer-readable storage medium.

First claim

Opening claim text (preview).

We claim: 1. A method for visual question answering, comprising: acquiring an input image and an input question; constructing a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; updating the Node Feature by using the Node Feature and the Edge Feature to obtain an updated Visual Graph; determining a question feature based on the input question; fusing the updated Visual Graph and the question feature to obtain a fused feature; and generating a predicted answer for the input image and the input question based on the fused feature, wherein the updating the Node Feature by using the Node Feature and the Edge Feature comprises: performing at least one round of updating operation on the Node Feature of the Visual Graph by using a predetermined neural network, wherein the predetermined neural network comprises a Fully Connected Layer, a first Graph Convolutional Layer and a second Graph Convolutional Layer, wherein each of the at least one round of updating operation comprises: mapping the Node Feature of the Visual Graph to a first feature by using the Fully Connected Layer, wherein a number of spatial dimensions of the first feature equals to a predetermined number; processing the first feature by using the first Graph Convolutional Layer to obtain a second feature; processing the second feature by using the second Graph Convolutional Layer to obtain the updated Node Feature; and constructing the updated Visual Graph by the updated Node Feature and the Edge Feature, wherein each of the at least one round of updating operation further comprises: constructing a Graph Laplacians based on the Edge Feature; and wherein the processing the first feature by using the first Graph Convolutional Layer comprises: processing the first feature by using the first Graph Convolutional Layer based on the Graph Laplacians to obtain the second feature, wherein the second feature comprises a plurality of first sub-features. 2. The method of claim 1 , wherein the constructing the Visual Graph based on the input image comprises: processing the input image by using an Object Detection Network to extract an appearance feature and a spatial feature of a plurality of target objects in the input image from a middle layer of the Object Detection Network; determining the Node Feature based on the appearance feature and the spatial feature; determining position information of each of the plurality of target objects based on a processing result output by an output layer of the Object Detection Network; determining a position relationship between any two of the plurality of target objects based on the position information of each of the plurality of target objects; determining the Edge Feature based on the position relationship between the any two target objects; and constructing the Visual Graph by the Node Feature and the Edge Feature. 3. The method of claim 2 , wherein the determining the position relationship between any two of the plurality of target objects based on the position information of each of the plurality of target objects comprises: calculating an intersection and an union of position regions of the any two target objects according to position information of each of the any two target objects; calculating a ratio of the intersection and the union; indicating the position relationship between the any two target objects as 1, in response to the ratio being greater than a predetermined threshold; and indicating the position relationship between the any two target objects as 0, in response to the ratio being less than or equal to the predetermined threshold. 4. The method of claim 1 , wherein the predetermined neural network further comprises an association layer; wherein each of the at least one round of updating operation further comprises: calculating an association relationship between any two of the plurality of first sub-features by using the association layer, and determining a relationship matrix based on the association relationship between the any two first sub-features; and wherein the processing the second feature by using the second Graph Convolutional Layer comprises: processing the second feature by using the second Graph Convolutional Layer based on the relationship matrix to obtain the updated Node Feature. 5. The method of claim 4 , wherein the association relationship between the any two first sub-features comprises: a Euclidean distance between the any two first sub-features; or a cosine similarity between the any two first sub-features. 6. The method of claim 1 , wherein the determining the question feature based on the input question comprises: encoding the input question successively by using a Word Embedding Algorithm and a feature embedding algorithm to obtain the question feature. 7. The method of claim 1 , wherein the updated Visual Graph comprises the updated Node Feature, and the updated Node Feature comprise a plurality of second sub-features; and wherein the fusing the updated Visual Graph and the question feature comprises: determining an attention weight between each of the plurality of second sub-features and the question feature based on an attention mechanism; performing weighted sum on the plurality of second sub-features by using the attention weight between each of the second sub-features and the question feature to obtain an adaptive feature; and fusing the adaptive feature and the question feature to obtain the fused feature. 8. The method of claim 7 , wherein the fusing the adaptive feature and the question feature comprises: performing an Element-wise dot product operation on the adaptive feature and the question feature to obtain the fused feature. 9. The method of claim 8 , wherein the generating the predicted answer for the input image and the input question based on the fused feature comprises: processing the fused feature by using a Multi-Layer Perceptron to obtain the predicted answer for the fused feature. 10. An apparatus for visual question answering, comprising: an acquiring module, configured to acquire an input image and an input question; a graph constructing module, configured to construct a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; an updating module, configured to update the Node Feature by using the Node Feature and the Edge Feature to obtain an updated Visual Graph; a question feature extracting module, configured to determine a question feature based on the input question; a fusing module, configured to fuse the updated Visual Graph and the question feature to obtain a fused feature; and a predicting module, configured to generate a predicted answer for the input image and the input question based on the fused feature, wherein the updating module is further configured to perform at least one round of updating operation on the Node Feature of the Visual Graph by using a predetermined neural network, wherein the predetermined neural network comprises a Fully Connected Layer, a first Graph Convolutional Layer and a second Graph Convolutional Layer, wherein each of the at least one round of updating operation comprises: mapping the Node Feature of the Visual Graph to a first feature by using the Fully Connected Layer, wherein a number of spatial dimensions of the first feature equals to a predetermined number; processing the first feature by using the first Graph Convolutional Layer to obtain a second feature; processing the second feature by using the second Graph Convolutional Layer to obtain the updated Node Feature; and constructing the updated Visual Graph by the updated Node Feature

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Learning methods · CPC title

  • Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11768876B2 cover?
The present disclosure provides a method for visual question answering, which relates to a field of computer vision and natural language processing. The method includes: acquiring an input image and an input question; constructing a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; updating the Node Feature by using the Node Feature an…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F16/90332. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).