Performing visual relational reasoning

US12547893B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12547893-B2
Application numberUS-202217893026-A
CountryUS
Kind codeB2
Filing dateAug 22, 2022
Priority dateAug 22, 2022
Publication dateFeb 10, 2026
Grant dateFeb 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising, at a device: accessing a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and training a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature. 2 . The method of claim 1 , wherein the machine learning environment includes a vision transformer (ViT). 3 . The method of claim 1 , wherein the machine learning environment includes a convolutional neural network (CNN). 4 . The method of claim 1 , wherein each of the image concepts within the concept-feature dictionary is represented by a key, the key including a tuple that defines the at least two objects and an associated action, and wherein each of the image features within the concept-feature dictionary is represented by a value. 5 . The method of claim 4 , wherein each key within the dictionary is linked to a value. 6 . The method of claim 1 , wherein training the machine learning environment includes performing the global training operation. 7 . The method of claim 6 , wherein the global training operation trains a global task within the machine learning environment that clusters images with the same concept together to produce semantically consistent relational representations. 8 . The method of claim 1 , wherein training the machine learning environment includes performing the local training operation. 9 . The method of claim 8 , wherein the local training operation trains a local task within the machine learning environment that guides the machine learning environment to discover object-centric semantic correspondence across images. 10 . The method of claim 1 , further comprising, at the device: updating the concept-feature dictionary while the machine learning environment is being trained. 11 . A system comprising: a hardware processor of a device that is configured to: access a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and train a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature. 12 . The system of claim 11 , wherein the machine learning environment includes a vision transformer (ViT). 13 . The system of claim 11 , wherein the machine learning environment includes a convolutional neural network (CNN). 14 . The system of claim 11 , wherein each of the image concepts within the concept-feature dictionary is represented by a key, the key including a tuple that defines the at least two objects and an associated action, and wherein each of the image features within the concept-feature dictionary is represented by a value. 15 . A non-transitory computer-readable storage medium storing instructions that, when executed by a processor of a device, causes the processor to cause the device to: access a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and train a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature.

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • Clustering; Classification · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12547893B2 cover?
A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts …
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).