Graph convolutional networks with motif-based attention
US-2020285944-A1 · Sep 10, 2020 · US
US11768876B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11768876-B2 |
| Application number | US-202117161466-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 28, 2021 |
| Priority date | Jun 30, 2020 |
| Publication date | Sep 26, 2023 |
| Grant date | Sep 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure provides a method for visual question answering, which relates to a field of computer vision and natural language processing. The method includes: acquiring an input image and an input question; constructing a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; updating the Node Feature by using the Node Feature and the Edge Feature to obtain an updated Visual Graph; determining a question feature based on the input question; fusing the updated Visual Graph and the question feature to obtain a fused feature; and generating a predicted answer for the input image and the input question based on the fused feature. The present disclosure further provides an apparatus for visual question answering, a computer device and a non-transitory computer-readable storage medium.
Opening claim text (preview).
We claim: 1. A method for visual question answering, comprising: acquiring an input image and an input question; constructing a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; updating the Node Feature by using the Node Feature and the Edge Feature to obtain an updated Visual Graph; determining a question feature based on the input question; fusing the updated Visual Graph and the question feature to obtain a fused feature; and generating a predicted answer for the input image and the input question based on the fused feature, wherein the updating the Node Feature by using the Node Feature and the Edge Feature comprises: performing at least one round of updating operation on the Node Feature of the Visual Graph by using a predetermined neural network, wherein the predetermined neural network comprises a Fully Connected Layer, a first Graph Convolutional Layer and a second Graph Convolutional Layer, wherein each of the at least one round of updating operation comprises: mapping the Node Feature of the Visual Graph to a first feature by using the Fully Connected Layer, wherein a number of spatial dimensions of the first feature equals to a predetermined number; processing the first feature by using the first Graph Convolutional Layer to obtain a second feature; processing the second feature by using the second Graph Convolutional Layer to obtain the updated Node Feature; and constructing the updated Visual Graph by the updated Node Feature and the Edge Feature, wherein each of the at least one round of updating operation further comprises: constructing a Graph Laplacians based on the Edge Feature; and wherein the processing the first feature by using the first Graph Convolutional Layer comprises: processing the first feature by using the first Graph Convolutional Layer based on the Graph Laplacians to obtain the second feature, wherein the second feature comprises a plurality of first sub-features. 2. The method of claim 1 , wherein the constructing the Visual Graph based on the input image comprises: processing the input image by using an Object Detection Network to extract an appearance feature and a spatial feature of a plurality of target objects in the input image from a middle layer of the Object Detection Network; determining the Node Feature based on the appearance feature and the spatial feature; determining position information of each of the plurality of target objects based on a processing result output by an output layer of the Object Detection Network; determining a position relationship between any two of the plurality of target objects based on the position information of each of the plurality of target objects; determining the Edge Feature based on the position relationship between the any two target objects; and constructing the Visual Graph by the Node Feature and the Edge Feature. 3. The method of claim 2 , wherein the determining the position relationship between any two of the plurality of target objects based on the position information of each of the plurality of target objects comprises: calculating an intersection and an union of position regions of the any two target objects according to position information of each of the any two target objects; calculating a ratio of the intersection and the union; indicating the position relationship between the any two target objects as 1, in response to the ratio being greater than a predetermined threshold; and indicating the position relationship between the any two target objects as 0, in response to the ratio being less than or equal to the predetermined threshold. 4. The method of claim 1 , wherein the predetermined neural network further comprises an association layer; wherein each of the at least one round of updating operation further comprises: calculating an association relationship between any two of the plurality of first sub-features by using the association layer, and determining a relationship matrix based on the association relationship between the any two first sub-features; and wherein the processing the second feature by using the second Graph Convolutional Layer comprises: processing the second feature by using the second Graph Convolutional Layer based on the relationship matrix to obtain the updated Node Feature. 5. The method of claim 4 , wherein the association relationship between the any two first sub-features comprises: a Euclidean distance between the any two first sub-features; or a cosine similarity between the any two first sub-features. 6. The method of claim 1 , wherein the determining the question feature based on the input question comprises: encoding the input question successively by using a Word Embedding Algorithm and a feature embedding algorithm to obtain the question feature. 7. The method of claim 1 , wherein the updated Visual Graph comprises the updated Node Feature, and the updated Node Feature comprise a plurality of second sub-features; and wherein the fusing the updated Visual Graph and the question feature comprises: determining an attention weight between each of the plurality of second sub-features and the question feature based on an attention mechanism; performing weighted sum on the plurality of second sub-features by using the attention weight between each of the second sub-features and the question feature to obtain an adaptive feature; and fusing the adaptive feature and the question feature to obtain the fused feature. 8. The method of claim 7 , wherein the fusing the adaptive feature and the question feature comprises: performing an Element-wise dot product operation on the adaptive feature and the question feature to obtain the fused feature. 9. The method of claim 8 , wherein the generating the predicted answer for the input image and the input question based on the fused feature comprises: processing the fused feature by using a Multi-Layer Perceptron to obtain the predicted answer for the fused feature. 10. An apparatus for visual question answering, comprising: an acquiring module, configured to acquire an input image and an input question; a graph constructing module, configured to construct a Visual Graph based on the input image, wherein the Visual Graph comprises a Node Feature and an Edge Feature; an updating module, configured to update the Node Feature by using the Node Feature and the Edge Feature to obtain an updated Visual Graph; a question feature extracting module, configured to determine a question feature based on the input question; a fusing module, configured to fuse the updated Visual Graph and the question feature to obtain a fused feature; and a predicting module, configured to generate a predicted answer for the input image and the input question based on the fused feature, wherein the updating module is further configured to perform at least one round of updating operation on the Node Feature of the Visual Graph by using a predetermined neural network, wherein the predetermined neural network comprises a Fully Connected Layer, a first Graph Convolutional Layer and a second Graph Convolutional Layer, wherein each of the at least one round of updating operation comprises: mapping the Node Feature of the Visual Graph to a first feature by using the Fully Connected Layer, wherein a number of spatial dimensions of the first feature equals to a predetermined number; processing the first feature by using the first Graph Convolutional Layer to obtain a second feature; processing the second feature by using the second Graph Convolutional Layer to obtain the updated Node Feature; and constructing the updated Visual Graph by the updated Node Feature
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Learning methods · CPC title
Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.