Performing visual relational reasoning
US-2024062534-A1 · Feb 22, 2024 · US
US12547893B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12547893-B2 |
| Application number | US-202217893026-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 22, 2022 |
| Priority date | Aug 22, 2022 |
| Publication date | Feb 10, 2026 |
| Grant date | Feb 10, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.
Opening claim text (preview).
What is claimed is: 1 . A method comprising, at a device: accessing a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and training a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature. 2 . The method of claim 1 , wherein the machine learning environment includes a vision transformer (ViT). 3 . The method of claim 1 , wherein the machine learning environment includes a convolutional neural network (CNN). 4 . The method of claim 1 , wherein each of the image concepts within the concept-feature dictionary is represented by a key, the key including a tuple that defines the at least two objects and an associated action, and wherein each of the image features within the concept-feature dictionary is represented by a value. 5 . The method of claim 4 , wherein each key within the dictionary is linked to a value. 6 . The method of claim 1 , wherein training the machine learning environment includes performing the global training operation. 7 . The method of claim 6 , wherein the global training operation trains a global task within the machine learning environment that clusters images with the same concept together to produce semantically consistent relational representations. 8 . The method of claim 1 , wherein training the machine learning environment includes performing the local training operation. 9 . The method of claim 8 , wherein the local training operation trains a local task within the machine learning environment that guides the machine learning environment to discover object-centric semantic correspondence across images. 10 . The method of claim 1 , further comprising, at the device: updating the concept-feature dictionary while the machine learning environment is being trained. 11 . A system comprising: a hardware processor of a device that is configured to: access a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and train a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature. 12 . The system of claim 11 , wherein the machine learning environment includes a vision transformer (ViT). 13 . The system of claim 11 , wherein the machine learning environment includes a convolutional neural network (CNN). 14 . The system of claim 11 , wherein each of the image concepts within the concept-feature dictionary is represented by a key, the key including a tuple that defines the at least two objects and an associated action, and wherein each of the image features within the concept-feature dictionary is represented by a value. 15 . A non-transitory computer-readable storage medium storing instructions that, when executed by a processor of a device, causes the processor to cause the device to: access a training image labeled with information defining a concept depicted in the training image and a concept-feature dictionary that correlates image features with image concepts, each of the image concepts indicating a relationship between at least two depicted objects; and train a machine learning environment from the training image and the concept-feature dictionary to be able to infer a concept depicted in a given image, the training including: (a) performing a global training operation within the machine learning environment by: retrieving an image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; and performing contrastive learning using the training image or an augmented version of training image, and the retrieved image feature; and (b) performing a local training operation within the machine learning environment by: retrieving the image feature from the concept-feature dictionary, utilizing the concept depicted in the training image; tokenizing the training image or an augmented version of the training image to create image tokens; and performing contrastive learning using the image tokens and the retrieved image feature.
Related publications grouped by family.
Answers are generated from the same data shown on this page.