Visual question generation with answer-awareness and region-reference

US12444062B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12444062-B2
Application numberUS-202117163268-A
CountryUS
Kind codeB2
Filing dateJan 29, 2021
Priority dateJan 29, 2021
Publication dateOct 14, 2025
Grant dateOct 14, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for visual question generation includes training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image. A k-nearest neighbors (KNN) graph is constructed by performing an aligned embedding for each region of the image. A node embedding component is generated by using a graph embedding component of the KNN graph. A visual question is generated by sequence decoding each image and graph of the image.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for visual question generation, the method comprising: training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image; constructing a graph of nodes representing objects in the image by performing an aligned embedding for each region of the image, the graph including edges connecting the nodes, the edges indicating relationships amongst the objects of the image, the constructing the graph of nodes comprising performing a double hint cross-modal operation: with a gating function to weaken the answer hint irrelevant regions and to enhance alignment between visual and textual hints, wherein the double hint cross-modal operation comprises: performing a visual hint alignment by aligning one or more regions R of the image with one or more region hints R gt; calculating an intersection over union between R and R gt ; determining whether a particular region r i is positive based on its intersection over union among any of the one or more region hints R gt being larger than a predetermined threshold θ; and projecting the particular region ri to an embedding space, wherein the constructing the graph of nodes further comprises inputting the aligned embeddings into a graph neural network; and generating a visual question by sequence decoding the image and the graph of nodes. 2. The computer-implemented method of claim 1 , further comprising generating the visual question, the answer hint, and the visual hint using a Graph2Seq model. 3. The computer-implemented method of claim 2 , further comprising: applying by a residual network (ResNet) the image attention; and applying the graph attention to the graph by an object detection model using a mask regional convolutional neural network (Mask RCNN). 4. The computer-implemented method of claim 1 , wherein the sequence decoding comprises applying an image attention to the image and a graph attention to the graph, respectively. 5. The computer-implemented method of claim 1 , wherein performing the double hint cross-modal alignment operation encodes a component that is output to an embedding space. 6. The computer-implemented method of claim 1 , wherein the generating of the visual question further includes generating a side information comprising an answer to the visual question. 7. The computer-implemented method of claim 1 , further comprising: using a natural language processing tool to identify a noun-phrase in the visual question and an answer; and aligning the noun-phrase with an object in the image. 8. The computer-implemented method of claim 1 , further comprising providing the generated visual question to a machine learning model. 9. The computer-implemented method of claim 1 , further comprising using feature aggregation to construct the graph. 10. The computer-implemented method of claim 1 , wherein the graph is a k-nearest neighbor (KNN) graph. 11. A computing device for visual question generation, comprising: a processor; a memory coupled to the processor, the memory storing instructions to cause the processor to perform acts comprising: training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image; constructing a graph of nodes representing objects in the image by performing an aligned embedding for each region of the image, the graph including edges connecting the nodes, the edges indicating relationships amongst the objects of the image, the constructing the graph of nodes comprising performing a double hint cross-modal alignment operation: with a gating function to weaken the answer hint irrelevant regions and to enhance alignment between visual and textual hints, wherein the double hint cross-modal operation comprises: performing a visual hint alignment by aligning one or more regions R of the image with one or more region hints R gt ; calculating an intersection over union between R and R gt ; determining whether a particular region r i is positive based on its intersection over union among any of the one or more region hints R gt being larger than a predetermined threshold θ; and projecting the particular region r i to an embedding space, wherein the constructing the graph of nodes further comprises inputting the aligned embeddings into a graph neural network ; and generating a visual question by sequence decoding the image and the graph of nodes. 12. The computing device of claim 11 , wherein the instructions cause the processor to perform an additional act comprising: generating the visual question, the answer hint, and the visual hint using a Graph2Seq model. 13. The computing device claim 11 , wherein the sequence decoding comprises applying an image attention to the image, and a graph attention to the graph, respectively. 14. The computing device of claim 11 , wherein the double hint cross-modal alignment operation encodes a component that is output to an embedding space. 15. The computing device of claim 11 , wherein the instructions cause the processor to perform additional acts comprising: using a natural language processing tool to identify a noun-phrase in the visual question and an answer; and aligning the nounphrase with an object in the image. 16. The computing device of claim 11 , wherein the instructions cause the processor to perform an additional act comprising: instructing a multi-task decoder to process one or more images using the generated visual question. 17. The computing device of claim 16 , wherein the one or more images processed by the multi-task decoder includes raw images. 18. A non-transitory computer-readable storage medium tangibly embodying a computer-readable program code having computer-readable instructions that, when executed, causes a computer device to perform a method for visual question generation, the method comprising: training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image; constructing a graph of nodes representing objects in the image by performing an aligned embedding for each region of the image, the graph including edges connecting the nodes, the edges indicating relationships amongst the objects of the image, the constructing the graph of nodes comprising a double hint cross-modal alignment operation: with a gating function to weaken the answer hint irrelevant regions and to enhance alignment between visual and textual hints, wherein the double hint cross-modal operation comprises: performing a visual hint alignment by aligning one or more regions R of the image with one or more region hints R gt ; calculating an intersection over union between R and R gt ; determining whether a particular region r i is positive based on its intersection over union among any of the one or more region hints R gt being larger than a predetermined threshold θ; and projecting the particular region ri to an embedding space, wherein the constructing the graph of nodes further comprises inputting the aligned embeddings into a graph neural network; and generating a visual question by sequence decoding the image and the graph of nodes. 19. The non-transitory computer-readable storage medium of claim 18 , further comprising computer-readable instructions that, when executed, cause the computer device to perform an additional act comprising: generating the visual question, the answer hint, and the visual hint using a Graph2Seq model.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Graphical models, e.g. Bayesian networks · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • of the type wherein the student is expected to construct an answer to the question which is presented or wherein the machine gives an answer to the question presented by a student · CPC title

  • Phrasal analysis, e.g. finite state techniques or chunking · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12444062B2 cover?
A computer-implemented method for visual question generation includes training an alignment module to analyze an image, an answer hint, and a visual hint with respect to the image. A k-nearest neighbors (KNN) graph is constructed by performing an aligned embedding for each region of the image. A node embedding component is generated by using a graph embedding component of the KNN graph. A visua…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06T7/33. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).