What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Performing visual relational reasoning

US12547893B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12547893-B2
Application number	US-202217893026-A
Country	US
Kind code	B2
Filing date	Aug 22, 2022
Priority date	Aug 22, 2022
Publication date	Feb 10, 2026
Grant date	Feb 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising, at a device: accessing a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and training a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature. 2 . The method of claim 1 , wherein the machine learning environment includes a vision transformer (ViT). 3 . The method of claim 1 , wherein the machine learning environment includes a convolutional neural network (CNN). 4 . The method of claim 1 , wherein each of the image concepts within the concept-feature dictionary is represented by a key, the key including a tuple that defines the at least two objects and an associated action, and wherein each of the image features within the concept-feature dictionary is represented by a value. 5 . The method of claim 4 , wherein each key within the dictionary is linked to a value. 6 . The method of claim 1 , wherein training the machine learning environment includes performing the global training operation. 7 . The method of claim 6 , wherein the global training operation trains a global task within the machine learning environment that clusters images with the same concept together to produce semantically consistent relational representations. 8 . The method of claim 1 , wherein training the machine learning environment includes performing the local training operation. 9 . The method of claim 8 , wherein the local training operation trains a local task within the machine learning environment that guides the machine learning environment to discover object-centric semantic correspondence across images. 10 . The method of claim 1 , further comprising, at the device: updating the concept-feature dictionary while the machine learning environment is being trained. 11 . A system comprising: a hardware processor of a device that is configured to: access a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and train a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature. 12 . The system of claim 11 , wherein the machine learning environment includes a vision transformer (ViT). 13 . The system of claim 11 , wherein the machine learning environment includes a convolutional neural network (CNN). 14 . The system of claim 11 , wherein each of the image concepts within the concept-feature dictionary is represented by a key, the key including a tuple that defines the at least two objects and an associated action, and wherein each of the image features within the concept-feature dictionary is represented by a value. 15 . A non-transitory computer-readable storage medium storing instructions that, when executed by a processor of a device, causes the processor to cause the device to: access a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and train a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature.

Assignees

Nvidia Corp

Inventors

Classifications

G06N3/04
Architecture, e.g. interconnection topology · CPC title
G06F16/55
Clustering; Classification · CPC title
G06N3/08Primary
Learning methods · CPC title
G06N3/045Primary
Combinations of networks · CPC title

Patent family

Related publications grouped by family.

View patent family 90060675

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12547893B2 cover?: A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts …
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).