What technology area does this patent fall under?

Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.

When was this patent published?

Publication date Thu Dec 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Techniques for vision-based robot control using multi-view pretraining

US2025381667A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2025381667-A1
Application number	US-202519173679-A
Country	US
Kind code	A1
Filing date	Apr 8, 2025
Priority date	Jun 18, 2024
Publication date	Dec 18, 2025
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosed method for training a robot control model includes performing, based on a plurality of multi-view images that have been masked, one or more operations to train a first untrained machine learning model to generate a first trained machine learning model that comprises a trained encoder, where the first trained machine learning model is trained to generate a plurality of reconstructions of the plurality of multi-view images prior to being masked; and performing, based on robot demonstration data, one or more operations to train a second untrained machine learning model that comprises the trained encoder to generate a second trained machine learning model, where the second trained machine learning model is trained to control a robot to perform at least part of a task.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for training a robot control model, the method comprising: performing, based on a plurality of multi-view images that have been masked, one or more operations to train a first untrained machine learning model to generate a first trained machine learning model that comprises a trained encoder, wherein the first trained machine learning model is trained to generate a plurality of reconstructions of the plurality of multi-view images prior to being masked; and performing, based on robot demonstration data, one or more operations to train a second untrained machine learning model that comprises the trained encoder to generate a second trained machine learning model, wherein the second trained machine learning model is trained to control a robot to perform at least part of a task. 2 . The computer-implemented method of claim 1 , further comprising: generating, based on object geometry data, the plurality of multi-view images; and masking out at least one portion of each image included in the plurality of multi-view images. 3 . The computer-implemented method of claim 2 , wherein generating the plurality of multi-view images comprises: generating, based on the object geometry data, a point cloud; and rendering the point cloud using a plurality of virtual cameras to generate the plurality of multi-view images. 4 . The computer-implemented method of claim 2 , wherein masking out at least one portion of each image comprises randomly masking out one or more visual tokens of the image. 5 . The computer-implemented method of claim 1 , wherein performing one or more operations to train the first untrained machine learning model comprises: generating, based on the plurality of multi-view images that have been masked, one or more multi-view embeddings using an untrained encoder included in the untrained machine learning model; generating, based on the one or more multi-view embeddings, another plurality of reconstructions of the multi-view images using a decoder included in the untrained machine learning model; calculating, based on the another plurality of reconstructions and the plurality of multi-view images, a loss; and updating, based on the loss, one or more parameters of the first untrained machine learning model. 6 . The computer-implemented method of claim 5 , wherein the loss is a pixel-wise reconstruction loss that measures differences between pixels in the another plurality of reconstructions and pixels in the plurality of multi-view images. 7 . The computer-implemented method of claim 5 , wherein the decoder comprises a masked autoencoder. 8 . The computer-implemented method of claim 1 , wherein the robot demonstration data comprises another plurality of multi-view images, one or more language goals, and one or more ground truth robot actions. 9 . The computer-implemented method of claim 8 , wherein performing one or more operations to train the second untrained machine learning model comprises generating, based on the another plurality of multi-view images, one or more multi-view embeddings using the trained encoder; generating, based on the one or more multi-view embeddings and the one or more language goals, one or more robot actions using a decoder included in the second untrained machine learning model; calculating, based on the one or more robot actions and the one or more ground truth robot actions, a loss; and updating, based on the loss, one or more parameters of the second untrained machine learning model. 10 . The computer-implemented method of claim 1 , further comprising: receiving sensor data from one or more sensors and one or more language goals; generating, based on the sensor data, another plurality of multi-view images; generating, based on the another plurality of multi-view images and the one or more language goals, one or more robot actions using the second trained machine learning model; generating, based on the one or more robot actions, one or more controls; and causing the robot to move based on the one or more controls. 11 . One or more non-transitory computer readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: performing, based on a plurality of multi-view images that have been masked, one or more operations to train a first untrained machine learning model to generate a first trained machine learning model that comprises a trained encoder, wherein the first trained machine learning model is trained to generate a plurality of reconstructions of the plurality of multi-view images prior to being masked; and performing, based on robot demonstration data, one or more operations to train a second untrained machine learning model that comprises the trained encoder to generate a second trained machine learning model, wherein the second trained machine learning model is trained to control a robot to perform at least part of a task. 12 . The one or more non-transitory computer-readable media of claim 11 , wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of: generating, based on object geometry data, the plurality of multi-view images; and masking out at least one portion of each image included in the plurality of multi-view images. 13 . The one or more non-transitory computer-readable media of claim 12 , wherein the plurality of multi-view images are rendered using a plurality of virtual cameras at predefined viewpoints around the object geometry data. 14 . The one or more non-transitory computer-readable media of claim 11 , wherein performing one or more operations to train the first untrained machine learning model comprises: generating, based on the plurality of multi-view images that have been masked, one or more multi-view embeddings using an untrained encoder included in the untrained machine learning model; generating, based on the one or more multi-view embeddings, another plurality of reconstructions of the multi-view images using a decoder included in the untrained machine learning model; calculating, based on the another plurality of reconstructions and the plurality of multi-view images, a loss; and updating, based on the loss, one or more parameters of the first untrained machine learning model. 15 . The one or more non-transitory computer-readable media of claim 11 , wherein the robot demonstration data comprises another plurality of multi-view images, one or more language goals, and one or more ground truth robot actions. 16 . The one or more non-transitory computer-readable media of claim 15 , wherein performing one or more operations to train the second untrained machine learning model comprises: generating, based on the another plurality of multi-view images, one or more multi-view embeddings using the trained encoder; generating, based on the one or more multi-view embeddings and the one or more language goals, one or more robot actions using a decoder included in the second untrained machine learning model; calculating, based on the one or more robot actions and the one or more ground truth robot actions, a loss; and updating, based on the loss, one or more parameters of the second untrained machine learning model. 17 . The one or more non-transitory computer-readable media of claim 11 , wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the steps of: receiving sensor data

Assignees

Nvidia Corp

Inventors

Classifications

B25J9/1697
Vision controlled systems · CPC title
B25J9/163Primary
learning, adaptive, model based, rule based expert control · CPC title
B25J19/023Primary
including video camera means · CPC title

Patent family

Related publications grouped by family.

View patent family 98013988

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025381667A1 cover?: The disclosed method for training a robot control model includes performing, based on a plurality of multi-view images that have been masked, one or more operations to train a first untrained machine learning model to generate a first trained machine learning model that comprises a trained encoder, where the first trained machine learning model is trained to generate a plurality of reconstructi…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.
When was this patent published?: Publication date Thu Dec 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).