Visual question answering model, electronic device and storage medium
US-2020293921-A1 · Sep 17, 2020 · US
US12444062B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12444062-B2 |
| Application number | US-202117163268-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 29, 2021 |
| Priority date | Jan 29, 2021 |
| Publication date | Oct 14, 2025 |
| Grant date | Oct 14, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented method for visual question generation includes training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image. A k-nearest neighbors (KNN) graph is constructed by performing an aligned embedding for each region of the image. A node embedding component is generated by using a graph embedding component of the KNN graph. A visual question is generated by sequence decoding each image and graph of the image.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for visual question generation, the method comprising: training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image; constructing a graph of nodes representing objects in the image by performing an aligned embedding for each region of the image, the graph including edges connecting the nodes, the edges indicating relationships amongst the objects of the image, the constructing the graph of nodes comprising performing a double hint cross-modal operation: with a gating function to weaken the answer hint irrelevant regions and to enhance alignment between visual and textual hints, wherein the double hint cross-modal operation comprises: performing a visual hint alignment by aligning one or more regions R of the image with one or more region hints R gt; calculating an intersection over union between R and R gt ; determining whether a particular region r i is positive based on its intersection over union among any of the one or more region hints R gt being larger than a predetermined threshold θ; and projecting the particular region ri to an embedding space, wherein the constructing the graph of nodes further comprises inputting the aligned embeddings into a graph neural network; and generating a visual question by sequence decoding the image and the graph of nodes. 2. The computer-implemented method of claim 1 , further comprising generating the visual question, the answer hint, and the visual hint using a Graph2Seq model. 3. The computer-implemented method of claim 2 , further comprising: applying by a residual network (ResNet) the image attention; and applying the graph attention to the graph by an object detection model using a mask regional convolutional neural network (Mask RCNN). 4. The computer-implemented method of claim 1 , wherein the sequence decoding comprises applying an image attention to the image and a graph attention to the graph, respectively. 5. The computer-implemented method of claim 1 , wherein performing the double hint cross-modal alignment operation encodes a component that is output to an embedding space. 6. The computer-implemented method of claim 1 , wherein the generating of the visual question further includes generating a side information comprising an answer to the visual question. 7. The computer-implemented method of claim 1 , further comprising: using a natural language processing tool to identify a noun-phrase in the visual question and an answer; and aligning the noun-phrase with an object in the image. 8. The computer-implemented method of claim 1 , further comprising providing the generated visual question to a machine learning model. 9. The computer-implemented method of claim 1 , further comprising using feature aggregation to construct the graph. 10. The computer-implemented method of claim 1 , wherein the graph is a k-nearest neighbor (KNN) graph. 11. A computing device for visual question generation, comprising: a processor; a memory coupled to the processor, the memory storing instructions to cause the processor to perform acts comprising: training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image; constructing a graph of nodes representing objects in the image by performing an aligned embedding for each region of the image, the graph including edges connecting the nodes, the edges indicating relationships amongst the objects of the image, the constructing the graph of nodes comprising performing a double hint cross-modal alignment operation: with a gating function to weaken the answer hint irrelevant regions and to enhance alignment between visual and textual hints, wherein the double hint cross-modal operation comprises: performing a visual hint alignment by aligning one or more regions R of the image with one or more region hints R gt ; calculating an intersection over union between R and R gt ; determining whether a particular region r i is positive based on its intersection over union among any of the one or more region hints R gt being larger than a predetermined threshold θ; and projecting the particular region r i to an embedding space, wherein the constructing the graph of nodes further comprises inputting the aligned embeddings into a graph neural network ; and generating a visual question by sequence decoding the image and the graph of nodes. 12. The computing device of claim 11 , wherein the instructions cause the processor to perform an additional act comprising: generating the visual question, the answer hint, and the visual hint using a Graph2Seq model. 13. The computing device claim 11 , wherein the sequence decoding comprises applying an image attention to the image, and a graph attention to the graph, respectively. 14. The computing device of claim 11 , wherein the double hint cross-modal alignment operation encodes a component that is output to an embedding space. 15. The computing device of claim 11 , wherein the instructions cause the processor to perform additional acts comprising: using a natural language processing tool to identify a noun-phrase in the visual question and an answer; and aligning the nounphrase with an object in the image. 16. The computing device of claim 11 , wherein the instructions cause the processor to perform an additional act comprising: instructing a multi-task decoder to process one or more images using the generated visual question. 17. The computing device of claim 16 , wherein the one or more images processed by the multi-task decoder includes raw images. 18. A non-transitory computer-readable storage medium tangibly embodying a computer-readable program code having computer-readable instructions that, when executed, causes a computer device to perform a method for visual question generation, the method comprising: training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image; constructing a graph of nodes representing objects in the image by performing an aligned embedding for each region of the image, the graph including edges connecting the nodes, the edges indicating relationships amongst the objects of the image, the constructing the graph of nodes comprising a double hint cross-modal alignment operation: with a gating function to weaken the answer hint irrelevant regions and to enhance alignment between visual and textual hints, wherein the double hint cross-modal operation comprises: performing a visual hint alignment by aligning one or more regions R of the image with one or more region hints R gt ; calculating an intersection over union between R and R gt ; determining whether a particular region r i is positive based on its intersection over union among any of the one or more region hints R gt being larger than a predetermined threshold θ; and projecting the particular region ri to an embedding space, wherein the constructing the graph of nodes further comprises inputting the aligned embeddings into a graph neural network; and generating a visual question by sequence decoding the image and the graph of nodes. 19. The non-transitory computer-readable storage medium of claim 18 , further comprising computer-readable instructions that, when executed, cause the computer device to perform an additional act comprising: generating the visual question, the answer hint, and the visual hint using a Graph2Seq model.
Combinations of networks · CPC title
Graphical models, e.g. Bayesian networks · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
of the type wherein the student is expected to construct an answer to the question which is presented or wherein the machine gives an answer to the question presented by a student · CPC title
Phrasal analysis, e.g. finite state techniques or chunking · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.