End-to-end camera calibration for broadcast video
US-11861806-B2 · Jan 2, 2024 · US
US11854283B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11854283-B2 |
| Application number | US-202117169112-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 5, 2021 |
| Priority date | Jun 30, 2020 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure provides a method for visual question answering, which relates to fields of computer vision and natural language processing. The method includes: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information; determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information; determining a question feature based on the input question; and generating a predicted answer for the input image and the input question based on the global feature and the question feature. The present disclosure further provides a device for visual question answering, a computer device and a medium.
Opening claim text (preview).
We claim: 1. A method for visual question answering, comprising: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information; determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information; determining a question feature based on the input question; and generating a predicted answer for the input image and the input question based on the global feature and the question feature, wherein the generating a predicted answer for the input image and the input question comprises: generating M predicted answers for the input image and the input question, wherein M is an integer greater than 2, calculating edit distances between one of the M predicted answers and each of the other M−1 predicted answers except for the one predicted answer for each of the M predicted answers; summing the edit distances to obtain an evaluation for each predicted answer; and selecting a predicted answer having a highest evaluation from the M predicted answers as a preferred predicted answer. 2. The method of claim 1 , wherein the detecting visual information and position information of each of at least one text region in the input image comprises: detecting the input image by using a text detection model to generate a bounding box for each of the at least one text region in the input image, wherein, image information in the bounding box for each of the at least one text region is for indicating the visual information of each text region, and position information of the bounding box for each text region is for indicating the position information of each text region. 3. The method of claim 1 , wherein the determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information comprises: for each of the at least one text region, recognizing the visual information of each text region by using a text recognition model to obtain the semantic information of each text region. 4. The method of claim 3 , wherein the attribute information comprises: table attribute information; and wherein, the determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information further comprises: detecting position information of at least one table region in the input image by using a table detection tool; and determining the table attribute information of each text region based on the position information of each text region and the position information of the at least one table region, wherein the table attribute information is for indicating whether each text region is located in the at least one table region or not. 5. The method of claim 4 , wherein the determining the table attribute information of each text region based on the position information of each text region and the position information of the at least one table region comprises: calculating an intersection between each text region and each table region and a union between each text region and each table region according to the position information of each text region and the position information of each of the at least one table region; calculating a ratio between the intersection and the union; determining table attribute information of each text region with respect to each table region as 1, in response to the ratio being greater than a predetermined threshold; and determining the table attribute information of each text region with respect to each table region as 0, in response to the ratio being less than or equal to the predetermined threshold. 6. The method of claim 3 , wherein the attribute information comprises: text attribute information; and wherein the determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information further comprises: recognizing the visual information of each text region by using a handwritten-text recognition model to determine the text attribute information of each text region, wherein the text attribute information is for indicating whether the text region contains handwritten text or not. 7. The method of claim 1 , wherein the determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information comprises: for each of the at least one text region, converting the visual information, the position information, the semantic information, and the attribute information of each text region into a first feature, a second feature, a third feature, and a fourth feature respectively, and merging the first feature, the second feature, the third feature, and the fourth feature into a feature of each text region; determining an arrangement order for the at least one text region according to position information of each of the at least one text region; and encoding the features of the at least one text region successively by using a predetermined encoding model according to the arrangement order, to obtain the global feature of the input image. 8. The method of claim 7 , wherein the merging the first feature, the second feature, the third feature, and the fourth feature into a feature of each text region comprises: performing a concatenate mergence on the first feature, the second feature, the third feature, and the fourth feature to obtain the feature of each text region; or performing a vector addition on the first feature, the second feature, the third feature, and the fourth feature to obtain the feature of each text region. 9. The method of claim 1 , wherein the determining the question feature based on the input question comprises: encoding the input question successively by using a word embedding algorithm and a feature embedding algorithm to obtain the question feature. 10. The method of claim 1 , wherein the generating a predicted answer for the input image and the input question based on the global feature and the question feature comprises: merging the global feature and the question feature to obtain a fusion feature; and processing the fusion feature by using a first prediction model to obtain a predicted answer for the fusion feature, wherein the first prediction model is a model being trained based on a sample image, a sample question, and a first label, wherein the first label is for indicating a real answer for the sample image and the sample question. 11. The method of claim 1 , wherein the generating a predicted answer for the input image and the input question based on the global feature and the question feature comprises: merging the global feature and the question feature to obtain a fusion feature; processing the fusion feature by using a second prediction model to obtain answer start position information for the fusion feature, wherein the second prediction model is a model being trained based on a sample image, a sample question, and a second label, wherein the second label is for indicating start position information of a real answer for the sample image and the sample question in the sample image; processing the fusion feature by using a third prediction model to obtain answer end position information for the fusion feature, wherein the third prediction model is a model being trained based on
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
using extracted text · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.