What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Self-supervised visual-relationship probing

US12475384B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12475384-B2
Application number	US-202017093185-A
Country	US
Kind code	B2
Filing date	Nov 9, 2020
Priority date	Nov 9, 2020
Publication date	Nov 18, 2025
Grant date	Nov 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems disclosed herein relate generally to systems and methods for generating visual relationship graphs that identify relationships between objects depicted in an image. A vision-language application uses transformer encoders to generate a graph structure, in which the graph structure represents a dependency between a first region and a second region of an image. The dependency indicates that a contextual representation of the first region was derived, at least in part, by processing the second region. The contextual representation identifies a predicted identity of an image object depicted in the first region. The predicted identity is determined at least in part by identifying a relationship between the first region and other data objects associated with various modalities.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving an image; receiving an indication of a vision-language (VL) operation relating to the image, wherein the VL operation includes at least one of a VL understanding task or a VL generation task, generating, by a vision-language modeling application, an input embedding that identifies a visual characteristic of a first region within the image and a position of the first region within the image; encoding, with a first transformer encoder of the vision-language modeling application, the input embedding into an intra-modality representation of the first region, wherein the intra-modality representation identifies an image object depicted in the first region based on analyzing a second region within the image and the intra-modality representation is a first feature vector; encoding, with a second transformer encoder of the vision-language modeling application, the intra-modality representation into an inter-modality representation of the first region, wherein the inter-modality representation is a second feature vector based on one or more visual feature vectors representing the image object and one or more textual feature vectors corresponding to a token that describes the image object, wherein the token is included in a plurality of tokens that are derived from a text sequence; generating, by the vision-language modeling application and from the inter-modality representation, a graph structure that represents a dependency between the first region and the second region, wherein the dependency indicates that the inter-modality representation of the first region was derived, at least in part, by processing the second region and comprising: computing pairwise distances between the one or more visual feature vectors and the one or more textual feature vectors of the inter-modality representations of the first region, wherein the pairwise distances represent relationships between the visual feature vectors and between the textual feature vectors, respectively; and constructing the graph structure based using the pairwise distances, wherein the relationship between the first region and the second region are based on the pairwise distances; executing the VL operation using the image and based on the dependency of the graph structure; and outputting a result, comprising information about the image based on an output of the execution of the VL operation. 2 . The method of claim 1 , wherein the VL operation further comprises at least one of: using the graph structure to identify another image that depicts a second image object that shares the visual characteristic and the position identified by the input embedding of the first region, or using the dependency of the graph structure to determine whether the text sequence characterizes a plurality of image objects depicted in the image. 3 . The method of claim 1 , wherein the graph structure includes a set of edges connecting the first region and one or more other regions, and wherein a length of an edge of the set of edges identifies a degree of relatedness between the first region and another region to which the edge is connected. 4 . The method of claim 1 , wherein encoding, with the second transformer encoder of the vision-language modeling application, the intra-modality representation into the inter-modality representation of the first region includes: executing, by the vision-language modeling application, a shared self-attention sub-layer of the second transformer encoder to process a plurality of regions and generate a first output; executing, by the vision-language modeling application, the shared self-attention sub-layer to process the plurality of tokens and generate a second output; and generating, by the vision-language modeling application, the inter-modality representation for the first region based on the first output and the second output. 5 . The method of claim 4 , further comprising: executing, by the vision-language modeling application, a cross-attention sub-layer of the second transformer encoder to process the plurality of regions with the plurality of tokens and generate a third output; and generating, by the vision-language modeling application, the inter-modality representation for the first region based on the second output and the third output. 6 . The method of claim 1 , further comprising overlaying the graph structure over the image. 7 . The method of claim 1 , further comprising generating a heat map that represents the graph structure, wherein the heat map includes a set of heat-map elements, and wherein a color of a particular heat-map element identifies a degree of relatedness between the first region and a region of one or more other regions. 8 . A system comprising: a processor; an input-embedding module configured to generate an input embedding for a token of a set of tokens, wherein the input embedding encodes a position of the token within a text sequence from which the set of tokens were derived; a first transformer encoding module configured to encode the input embedding that represents the token into an intra-modality representation of the token, wherein the intra-modality representation identifies a definition of the token based on an analysis of one or more other tokens from the set of tokens and the intra-modality representation is a first feature vector; and a second transformer encoding module configured to encode the intra-modality representation into an inter-modality representation of the token, wherein the inter-modality representation is a second feature vector based on one or more textual feature vectors including the token defining a region of an image depicting an image object and one or more visual feature vectors representing the image object; and a relationship-probing module configured to generate, from the inter-modality representation, a graph structure that represents one or more dependencies between the token and the one or more other tokens by: computing pairwise distances between the one or more visual feature vectors and between the one or more textual feature vectors of the inter-modality representations, respectively, wherein the pairwise distances represent relationships between the visual feature vectors and the textual feature vectors; and constructing the graph structure based using the pairwise distances, wherein the relationship between the region of the image and other regions of the image are based on the pairwise distances; and a non-transitory computer-readable medium having stored thereon instructions that, when executed by the processor, cause the processor to perform operations including: receiving the image; receiving an indication of a vision-language (VL) operation relating to the image, wherein the VL operation includes at least one of a VL understanding task or a VL generation task; outputting the image to the input-embedding module; receiving, from the relationship-probing module, the graph structure; executing the VL operation using the image and based on the dependency of the graph structure; and outputting a result, comprising information about the image based on an output of the execution of the VL operation. 9 . The system of claim 8 , wherein the instructions further cause the processor to: generate another graph structure that represents one or more second dependencies between a plurality of regions of the image, wherein the one or more second dependencies between the plurality of regions are derived by processing the set of tokens. 10 . The system of claim 8 , wherein the graph structure includes a set of edges connecting the token with the one or more other tokens, and wherein a

Assignees

Adobe Inc

Inventors

Classifications

G06N7/00
Computing arrangements based on specific mathematical models · CPC title
G06T7/90
Determination of colour characteristics · CPC title
G06N3/08Primary
Learning methods · CPC title
G06N3/0895
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

View patent family 81454436

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12475384B2 cover?: Methods and systems disclosed herein relate generally to systems and methods for generating visual relationship graphs that identify relationships between objects depicted in an image. A vision-language application uses transformer encoders to generate a graph structure, in which the graph structure represents a dependency between a first region and a second region of an image. The dependency i…
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method, apparatus, device and medium for generating captioning information of multimedia data

Selecting annotations for training images using a neural network

Unsupervised learning of scene structure for synthetic data generation

Contextual grounding of natural language phrases in images

Learning to generate synthetic datasets for traning neural networks

Structured Knowledge Modeling and Extraction from Images

Frequently asked questions