Method, apparatus, device and medium for generating captioning information of multimedia data
US-2022014807-A1 · Jan 13, 2022 · US
US12475384B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12475384-B2 |
| Application number | US-202017093185-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 9, 2020 |
| Priority date | Nov 9, 2020 |
| Publication date | Nov 18, 2025 |
| Grant date | Nov 18, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems disclosed herein relate generally to systems and methods for generating visual relationship graphs that identify relationships between objects depicted in an image. A vision-language application uses transformer encoders to generate a graph structure, in which the graph structure represents a dependency between a first region and a second region of an image. The dependency indicates that a contextual representation of the first region was derived, at least in part, by processing the second region. The contextual representation identifies a predicted identity of an image object depicted in the first region. The predicted identity is determined at least in part by identifying a relationship between the first region and other data objects associated with various modalities.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: receiving an image; receiving an indication of a vision-language (VL) operation relating to the image, wherein the VL operation includes at least one of a VL understanding task or a VL generation task, generating, by a vision-language modeling application, an input embedding that identifies a visual characteristic of a first region within the image and a position of the first region within the image; encoding, with a first transformer encoder of the vision-language modeling application, the input embedding into an intra-modality representation of the first region, wherein the intra-modality representation identifies an image object depicted in the first region based on analyzing a second region within the image and the intra-modality representation is a first feature vector; encoding, with a second transformer encoder of the vision-language modeling application, the intra-modality representation into an inter-modality representation of the first region, wherein the inter-modality representation is a second feature vector based on one or more visual feature vectors representing the image object and one or more textual feature vectors corresponding to a token that describes the image object, wherein the token is included in a plurality of tokens that are derived from a text sequence; generating, by the vision-language modeling application and from the inter-modality representation, a graph structure that represents a dependency between the first region and the second region, wherein the dependency indicates that the inter-modality representation of the first region was derived, at least in part, by processing the second region and comprising: computing pairwise distances between the one or more visual feature vectors and the one or more textual feature vectors of the inter-modality representations of the first region, wherein the pairwise distances represent relationships between the visual feature vectors and between the textual feature vectors, respectively; and constructing the graph structure based using the pairwise distances, wherein the relationship between the first region and the second region are based on the pairwise distances; executing the VL operation using the image and based on the dependency of the graph structure; and outputting a result, comprising information about the image based on an output of the execution of the VL operation. 2 . The method of claim 1 , wherein the VL operation further comprises at least one of: using the graph structure to identify another image that depicts a second image object that shares the visual characteristic and the position identified by the input embedding of the first region, or using the dependency of the graph structure to determine whether the text sequence characterizes a plurality of image objects depicted in the image. 3 . The method of claim 1 , wherein the graph structure includes a set of edges connecting the first region and one or more other regions, and wherein a length of an edge of the set of edges identifies a degree of relatedness between the first region and another region to which the edge is connected. 4 . The method of claim 1 , wherein encoding, with the second transformer encoder of the vision-language modeling application, the intra-modality representation into the inter-modality representation of the first region includes: executing, by the vision-language modeling application, a shared self-attention sub-layer of the second transformer encoder to process a plurality of regions and generate a first output; executing, by the vision-language modeling application, the shared self-attention sub-layer to process the plurality of tokens and generate a second output; and generating, by the vision-language modeling application, the inter-modality representation for the first region based on the first output and the second output. 5 . The method of claim 4 , further comprising: executing, by the vision-language modeling application, a cross-attention sub-layer of the second transformer encoder to process the plurality of regions with the plurality of tokens and generate a third output; and generating, by the vision-language modeling application, the inter-modality representation for the first region based on the second output and the third output. 6 . The method of claim 1 , further comprising overlaying the graph structure over the image. 7 . The method of claim 1 , further comprising generating a heat map that represents the graph structure, wherein the heat map includes a set of heat-map elements, and wherein a color of a particular heat-map element identifies a degree of relatedness between the first region and a region of one or more other regions. 8 . A system comprising: a processor; an input-embedding module configured to generate an input embedding for a token of a set of tokens, wherein the input embedding encodes a position of the token within a text sequence from which the set of tokens were derived; a first transformer encoding module configured to encode the input embedding that represents the token into an intra-modality representation of the token, wherein the intra-modality representation identifies a definition of the token based on an analysis of one or more other tokens from the set of tokens and the intra-modality representation is a first feature vector; and a second transformer encoding module configured to encode the intra-modality representation into an inter-modality representation of the token, wherein the inter-modality representation is a second feature vector based on one or more textual feature vectors including the token defining a region of an image depicting an image object and one or more visual feature vectors representing the image object; and a relationship-probing module configured to generate, from the inter-modality representation, a graph structure that represents one or more dependencies between the token and the one or more other tokens by: computing pairwise distances between the one or more visual feature vectors and between the one or more textual feature vectors of the inter-modality representations, respectively, wherein the pairwise distances represent relationships between the visual feature vectors and the textual feature vectors; and constructing the graph structure based using the pairwise distances, wherein the relationship between the region of the image and other regions of the image are based on the pairwise distances; and a non-transitory computer-readable medium having stored thereon instructions that, when executed by the processor, cause the processor to perform operations including: receiving the image; receiving an indication of a vision-language (VL) operation relating to the image, wherein the VL operation includes at least one of a VL understanding task or a VL generation task; outputting the image to the input-embedding module; receiving, from the relationship-probing module, the graph structure; executing the VL operation using the image and based on the dependency of the graph structure; and outputting a result, comprising information about the image based on an output of the execution of the VL operation. 9 . The system of claim 8 , wherein the instructions further cause the processor to: generate another graph structure that represents one or more second dependencies between a plurality of regions of the image, wherein the one or more second dependencies between the plurality of regions are derived by processing the set of tokens. 10 . The system of claim 8 , wherein the graph structure includes a set of edges connecting the token with the one or more other tokens, and wherein a
Computing arrangements based on specific mathematical models · CPC title
Determination of colour characteristics · CPC title
Learning methods · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.