Who is the assignee on this patent?

Beijing Baidu Netcom Sci & Tech Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06F16/367. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for visual question answering, computer device and medium

US11775574B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11775574-B2
Application number	US-202117182987-A
Country	US
Kind code	B2
Filing date	Feb 23, 2021
Priority date	Jun 30, 2020
Publication date	Oct 3, 2023
Grant date	Oct 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for visual question answering, a computer device implementing the method and a medium for storing instructions on performing the method are provided. The method includes: acquiring an input image and an input question; constructing a visual graph based on the input image, wherein the visual graph comprises a first node feature and a first edge feature; constructing a question graph based on the input question, wherein the question graph comprises a second node feature and a second edge feature; performing a multimodal fusion on the visual graph and the question graph to obtain an updated visual graph and an updated question graph; determining a question feature based on the input question; determining a fusion feature based on the updated visual graph, the updated question graph and the question feature; and generating a predicted answer for the input image and the input question.

First claim

Opening claim text (preview).

We claim: 1. A method for visual question answering, comprising: acquiring an input image and an input question; constructing a visual graph based on the input image, wherein the visual graph comprises a first node feature and a first edge feature; constructing a question graph based on the input question, wherein the question graph comprises a second node feature and a second edge feature; performing a multimodal fusion on the visual graph and the question graph to obtain an updated visual graph and an updated question graph; determining a question feature based on the input question; determining a fusion feature based on the updated visual graph, the updated question graph and the question feature; and generating a predicted answer for the input image and the input question based on the fusion feature; wherein the performing the multimodal fusion on the visual graph and the question graph comprises: performing at least one round of multimodal fusion operation, wherein each of the at least one round of multimodal fusion operation comprises: encoding the first node feature by using a first predetermined network based on the first node feature and the first edge feature, to obtain an encoded visual graph; encoding the second node feature by using a second predetermined network based on the second node feature and the second edge feature, to obtain an encoded question graph; and performing a multimodal fusion on the encoded visual graph and the encoded question graph by using a graph match algorithm, to obtain the updated visual graph and the updated question graph; wherein the first predetermined network comprises: a first fully connected layer, a first graph convolutional layer and a second graph convolutional layer, and the encoding the first node feature comprises: mapping the first node feature to a first feature by using the first fully connected layer, wherein a number of spatial dimensions of the first feature equals to a predetermined number; processing the first feature by using the first graph convolutional layer to obtain a second feature; processing the second feature by using the second graph convolutional layer to obtain the encoded first node feature; and constructing the encoded visual graph by using the encoded first node feature and the first edge feature. 2. The method of claim 1 , wherein the constructing the visual graph based on the input image comprises: processing the input image by using an object detection network to extract an appearance feature and a spatial feature of a plurality of target objects in the input image from a middle layer of the object detection network, wherein the appearance feature comprises K 1 sub-features for K 1 target objects, with each sub-feature of the appearance feature being represented as a vector having a first number of spatial dimensions, and the spatial feature comprises K 1 sub-features for K 1 target objects, with each sub-feature of the spatial feature being represented as a vector having a second number of spatial dimensions, where K 1 is an integer greater than one; determining the first node feature based on the appearance feature and the spatial feature; determining position information of each of the plurality of target objects respectively, based on a processing result output by an output layer of the object detection network; determining a position relationship between any two of the plurality of target objects based on the position information of each of the plurality of target objects; determining the first edge feature based on the position relationship between the any two target objects; and constructing the visual graph by using the first node feature and the first edge feature. 3. The method of claim 2 , wherein the determining the position relationship between any two of the plurality of target objects respectively, based on the position information of each of the plurality of target objects comprises: calculating an intersection of position regions of the any two target objects and a union of the position regions of the any two target objects according to position information of each of the any two target objects; calculating a ratio between the intersection and the union; indicating the position relationship between the any two target objects as 1, in response to the ratio being greater than a predetermined threshold; and indicating the position relationship between the any two target objects as 0, in response to the ratio being less than or equal to the predetermined threshold. 4. The method of claim 1 , wherein the constructing the question graph based on the input question comprises: processing the input question successively by using a word embedding algorithm and a feature embedding algorithm to extract a plurality of word node features from the input question, wherein the plurality of word node features are used to indicate feature information of each of a plurality of words in the input question; determining a dependency relationship between any two of the plurality of words by using a dependency parsing algorithm; determining the second edge feature based on the dependency relationship between the any two words; and constructing the second node feature by using the plurality of word node features, and constructing the question graph by using the second node feature and the second edge feature. 5. The method of claim 1 , wherein the encoding the first node feature further comprises: constructing a first Graph Laplacians based on the first edge feature; and wherein, the processing the first feature by using the first graph convolutional layer comprises: processing the first feature by using the first graph convolutional layer based on the first Graph Laplacians to obtain the second feature, wherein the second feature comprises a plurality of first sub-features. 6. The method of claim 5 , wherein the first predetermined network further comprises a first association layer; wherein the encoding the first node feature further comprises: calculating an association relationship between any two of the plurality of first sub-features by using the first association layer, and determining a first relationship matrix based on the association relationship between the any two first sub-features; and wherein the processing the second feature by using the second graph convolutional layer comprises: processing the second feature by using the second graph convolutional layer based on the first relationship matrix to obtain the encoded first node feature. 7. The method of claim 1 , wherein the second predetermined network comprises: a second fully connected layer, a third graph convolutional layer and a fourth graph convolutional layer; and wherein the encoding the second node feature comprises: mapping the second node feature to a third feature by using the second fully connected layer, wherein a number of spatial dimensions of the third feature equals to a predetermined number; processing the third feature by using the third graph convolutional layer to obtain a fourth feature; processing the fourth feature by using the fourth graph convolutional layer to obtain the encoded second node feature; and constructing the encoded question graph by using the encoded second node feature and the second edge feature. 8. The method of claim 7 , wherein the encoding the second node feature further comprises: constructing a second Graph Laplacians based on the second edge feature; and wherein, the processing the third feature by using the third graph convolutional layer comprises: processing the third feature by using the third graph convolutional layer based on the second Graph Laplacians to obtain the fourth feature, wherein the fourth feature

Assignees

Beijing Baidu Netcom Sci & Tech Co Ltd

Inventors

Classifications

G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06F16/367Primary
Ontology · CPC title
G06F18/253
of extracted features · CPC title

Patent family

Related publications grouped by family.

View patent family 72760431

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11775574B2 cover?: A method for visual question answering, a computer device implementing the method and a medium for storing instructions on performing the method are provided. The method includes: acquiring an input image and an input question; constructing a visual graph based on the input image, wherein the visual graph comprises a first node feature and a first edge feature; constructing a question graph bas…
Who is the assignee on this patent?: Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06F16/367. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).