Who is the assignee on this patent?

Beijing Baidu Netcom Sci & Tech Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06V30/274. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for visual question answering, computer device and medium

US11854283B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11854283-B2
Application number	US-202117169112-A
Country	US
Kind code	B2
Filing date	Feb 5, 2021
Priority date	Jun 30, 2020
Publication date	Dec 26, 2023
Grant date	Dec 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides a method for visual question answering, which relates to fields of computer vision and natural language processing. The method includes: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information; determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information; determining a question feature based on the input question; and generating a predicted answer for the input image and the input question based on the global feature and the question feature. The present disclosure further provides a device for visual question answering, a computer device and a medium.

First claim

Opening claim text (preview).

We claim: 1. A method for visual question answering, comprising: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information; determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information; determining a question feature based on the input question; and generating a predicted answer for the input image and the input question based on the global feature and the question feature, wherein the generating a predicted answer for the input image and the input question comprises: generating M predicted answers for the input image and the input question, wherein M is an integer greater than 2, calculating edit distances between one of the M predicted answers and each of the other M−1 predicted answers except for the one predicted answer for each of the M predicted answers; summing the edit distances to obtain an evaluation for each predicted answer; and selecting a predicted answer having a highest evaluation from the M predicted answers as a preferred predicted answer. 2. The method of claim 1 , wherein the detecting visual information and position information of each of at least one text region in the input image comprises: detecting the input image by using a text detection model to generate a bounding box for each of the at least one text region in the input image, wherein, image information in the bounding box for each of the at least one text region is for indicating the visual information of each text region, and position information of the bounding box for each text region is for indicating the position information of each text region. 3. The method of claim 1 , wherein the determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information comprises: for each of the at least one text region, recognizing the visual information of each text region by using a text recognition model to obtain the semantic information of each text region. 4. The method of claim 3 , wherein the attribute information comprises: table attribute information; and wherein, the determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information further comprises: detecting position information of at least one table region in the input image by using a table detection tool; and determining the table attribute information of each text region based on the position information of each text region and the position information of the at least one table region, wherein the table attribute information is for indicating whether each text region is located in the at least one table region or not. 5. The method of claim 4 , wherein the determining the table attribute information of each text region based on the position information of each text region and the position information of the at least one table region comprises: calculating an intersection between each text region and each table region and a union between each text region and each table region according to the position information of each text region and the position information of each of the at least one table region; calculating a ratio between the intersection and the union; determining table attribute information of each text region with respect to each table region as 1, in response to the ratio being greater than a predetermined threshold; and determining the table attribute information of each text region with respect to each table region as 0, in response to the ratio being less than or equal to the predetermined threshold. 6. The method of claim 3 , wherein the attribute information comprises: text attribute information; and wherein the determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information further comprises: recognizing the visual information of each text region by using a handwritten-text recognition model to determine the text attribute information of each text region, wherein the text attribute information is for indicating whether the text region contains handwritten text or not. 7. The method of claim 1 , wherein the determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information comprises: for each of the at least one text region, converting the visual information, the position information, the semantic information, and the attribute information of each text region into a first feature, a second feature, a third feature, and a fourth feature respectively, and merging the first feature, the second feature, the third feature, and the fourth feature into a feature of each text region; determining an arrangement order for the at least one text region according to position information of each of the at least one text region; and encoding the features of the at least one text region successively by using a predetermined encoding model according to the arrangement order, to obtain the global feature of the input image. 8. The method of claim 7 , wherein the merging the first feature, the second feature, the third feature, and the fourth feature into a feature of each text region comprises: performing a concatenate mergence on the first feature, the second feature, the third feature, and the fourth feature to obtain the feature of each text region; or performing a vector addition on the first feature, the second feature, the third feature, and the fourth feature to obtain the feature of each text region. 9. The method of claim 1 , wherein the determining the question feature based on the input question comprises: encoding the input question successively by using a word embedding algorithm and a feature embedding algorithm to obtain the question feature. 10. The method of claim 1 , wherein the generating a predicted answer for the input image and the input question based on the global feature and the question feature comprises: merging the global feature and the question feature to obtain a fusion feature; and processing the fusion feature by using a first prediction model to obtain a predicted answer for the fusion feature, wherein the first prediction model is a model being trained based on a sample image, a sample question, and a first label, wherein the first label is for indicating a real answer for the sample image and the sample question. 11. The method of claim 1 , wherein the generating a predicted answer for the input image and the input question based on the global feature and the question feature comprises: merging the global feature and the question feature to obtain a fusion feature; processing the fusion feature by using a second prediction model to obtain answer start position information for the fusion feature, wherein the second prediction model is a model being trained based on a sample image, a sample question, and a second label, wherein the second label is for indicating start position information of a real answer for the sample image and the sample question in the sample image; processing the fusion feature by using a third prediction model to obtain answer end position information for the fusion feature, wherein the third prediction model is a model being trained based on

Assignees

Beijing Baidu Netcom Sci & Tech Co Ltd

Inventors

Classifications

G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/09
Supervised learning · CPC title
G06F16/5846
using extracted text · CPC title

Patent family

Related publications grouped by family.

View patent family 72761471

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11854283B2 cover?: The present disclosure provides a method for visual question answering, which relates to fields of computer vision and natural language processing. The method includes: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of …
Who is the assignee on this patent?: Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V30/274. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).