Image question answering method, apparatus and system, and storage medium

US11222236B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11222236-B2
Application numberUS-202016798359-A
CountryUS
Kind codeB2
Filing dateFeb 22, 2020
Priority dateOct 31, 2017
Publication dateJan 11, 2022
Grant dateJan 11, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An image question answering method includes: extracting a question feature representing a semantic meaning of a question, a global feature of an image, and a detection frame feature of a detection frame encircling an object in the image; obtaining a first weight of each of at least one area of the image and a second weight of each of at least one detection frame of the image according to question feature, global feature, and detection frame feature; performing weighting processing on global feature by using first weight to obtain an area attention feature of image; performing weighting processing on detection frame feature by using second weight to obtain a detection frame attention feature of image; and predicting an answer to question according to question feature, area attention feature, and detection frame attention feature.

First claim

Opening claim text (preview).

The invention claimed is: 1. An image question answering method, comprising: extracting a question feature representing a semantic meaning of a question, a global feature of an image, and a detection frame feature of a detection frame encircling an object in the image; obtaining a first weight of each of at least one area of the image and a second weight of each of at least one detection frame of the image according to the question feature, the global feature, and the detection frame feature; performing weighting processing on the global feature by using the first weight to obtain an area attention feature of the image; performing weighting processing on the detection frame feature by using the second weight to obtain a detection frame attention feature of the image; and predicting an answer to the question according to the question feature, the area attention feature, and the detection frame attention feature, wherein the extracting the detection frame feature of the detection frame encircling the object in the image comprises: obtaining a plurality of detection frames encircling the object in the image by using a faster-region convolutional neural network; determining at least one detection frame according to a difference between the object encircled by the plurality of detection frames and a background of the image; extracting at least one detection frame sub-feature according to the at least one detection frame; and obtaining the detection frame feature according to the at least one detection frame sub-feature. 2. The image question answering method according to claim 1 , wherein the extracting the question feature representing the semantic meaning of the question comprises: performing feature extraction on a context of words constituting the question by using a recurrent neural network to obtain the question feature. 3. The image question answering method according to claim 1 , wherein the extracting the global feature of the image comprises: extracting the global feature by using a convolutional neural network, wherein the global feature comprises a plurality of area features associated with a plurality of areas of the image. 4. The image question answering method according to claim 3 , wherein the obtaining the second weight of each of at least one area of the image according to the question feature, the global feature, and the detection frame feature comprises: unifying dimensions of the question feature, the global feature, and the detection frame feature; equalizing the dimension-unified global feature according to a number of the plurality of area features; and obtaining the second weight according to the dimension-unified question feature, the dimension-unified detection frame feature, and the dimension-unified and equalized global feature. 5. The image question answering method according to claim 1 , wherein the obtaining the first weight of each of at least one area of the image according to the question feature, the global feature, and the detection frame feature comprises: unifying the dimensions of the question feature, the global feature, and the detection frame feature; equalizing the dimension-unified detection frame feature according to a number of the at least one detection frame sub-features; and obtaining the first weight according to the dimension-unified question feature, the dimension-unified global feature, and the dimension-unified and equalized detection frame feature. 6. The image question answering method according to claim 1 , wherein the predicting the answer to the question according to the question feature, the area attention feature, and the detection frame attention feature comprises: fusing the question feature and the area attention feature to obtain a first predicted answer to the question; fusing the question feature and the detection frame attention feature to obtain a second predicted answer to the question; and obtaining the answer to the question by classifying the first predicted answer to the question and the second predicted answer to the question. 7. An electronic device, comprising: memory configured to store executable instructions; and a processor configured to communicate with the memory to execute the executable instructions, when the executable instructions are executed, the processor is configured to: extract a question feature representing a semantic meaning of a question, a global feature of an image, and a detection frame feature of a detection frame encircling an object in the image; obtain a first weight of each of at least one area of the image and a second weight of each of at least one detection frame of the image according to the question feature, the global feature, and the detection frame feature; perform weighting processing on the global feature by using the first weight to obtain an area attention feature of the image; perform weighting processing on the detection frame feature by using the second weight to obtain a detection frame attention feature of the image; and predict an answer to a question according to the question feature, the area attention feature, and the detection frame attention feature, wherein the processor is further configured to: obtain a plurality of detections frames encircling the object in the image by using a faster-region convolutional neural network; determine at least one detection frame according to a difference between the object encircled by the plurality of detection frames and a background of the image; extract at least one detection frame sub-feature according to the at least one detection frame; and obtaining the detection frame feature according to the at least one detection frame sub-feature. 8. The electronic device according to claim 7 , wherein the processor is further configured to perform feature extraction on a context of words constituting the question by using the recurrent neural network to obtain the question feature. 9. The electronic device according to claim 7 , wherein the processor is further configured to extract the global feature by using the convolutional neural network, wherein the global feature comprises a plurality of area features associated with a plurality of areas of the image. 10. The electronic device according to claim 9 , wherein the processor is further configured to: unify dimensions of the question feature, the global feature, and the detection frame feature; equalize the dimension-unified global feature according to a number of the plurality of area features; and obtain the second weight according to the dimension-unified question feature, the dimension-unified detection frame feature, and the dimension-unified and equalized global feature. 11. The electronic device according to claim 7 , wherein the processor is further configured to: unify the dimensions of the question feature, the global feature, and the detection frame feature; equalize the dimension-unified detection frame feature according to a number of the at least one detection frame sub-features; and obtain the first weight according to the dimension-unified question feature, the dimension-unified global feature, and the dimension-unified and equalized detection frame feature. 12. The electronic device according to claim 7 , wherein the processor is further configured to: fuse the question feature and the area attention feature to obtain a first predicted answer to the question; fuse the question feature and the detection frame attention feature to obtain a second predicted answer to the question; and obtain the answer to the question by classifying the first predicted answer to the question and the second predicted

Assignees

Inventors

Classifications

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Combinations of networks · CPC title

  • based on distances to training or reference patterns · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11222236B2 cover?
An image question answering method includes: extracting a question feature representing a semantic meaning of a question, a global feature of an image, and a detection frame feature of a detection frame encircling an object in the image; obtaining a first weight of each of at least one area of the image and a second weight of each of at least one detection frame of the image according to questi…
Who is the assignee on this patent?
Beijing Sensetime Tech Development Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 11 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).