Mitigating reality gap through training a simulation-to-real model using a vision-based robot task model
US-2024118667-A1 · Apr 11, 2024 · US
US12333787B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12333787-B2 |
| Application number | US-202217986428-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 14, 2022 |
| Priority date | Nov 16, 2021 |
| Publication date | Jun 17, 2025 |
| Grant date | Jun 17, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Implementations disclosed herein relate to mitigating the reality gap through feature-level domain adaptation in training of a vision-based robotic action machine learning (ML) model. Implementations mitigate the reality gap through utilization of embedding consistency losses and/or action consistency losses during training of the action ML model.
Opening claim text (preview).
What is claimed is: 1. A method implemented by one or more processors, the method comprising: generating a predicted real image based on processing a simulated image using a simulation-to-real generator model, wherein the simulated image is generated by a robotic simulator during performance of a robotic task by a simulated robot of the robotic simulator; in response to the predicted real image being generated based on processing the simulated image using the simulation-to-real generator model: pairing the simulated image with the predicted real image; processing the simulated image, using an action machine learning model being trained for use in controlling a robot to perform the robotic task, to generate one or more simulated image predicted action outputs, wherein processing the simulated image comprises: generating a simulated image embedding by processing the simulated image using vision feature layers of the action machine learning model; and processing the simulated image embedding using additional layers of the action machine learning model to generate the simulated image predicted action outputs; processing the predicted real image, using the action machine learning model, to generate one or more predicted real image predicted action outputs, wherein processing the predicted real image comprises: generating a predicted real image embedding by processing the predicted real image using the vision feature layers; and processing the predicted real image embedding using the additional layers to generate the real image predicted action outputs; in response to the pairing of the simulated image with the predicted real image: generating an embedding consistency loss as a function of comparison of the simulated image embedding and the predicted real image embedding; and updating the vision feature layers based on the generated embedding consistency loss. 2. The method of claim 1 , wherein updating the vision feature layers based on the generated embedding consistency loss comprises: backpropagating the loss across the vision feature layers without backpropagating the loss across the additional layers. 3. The method of claim 1 , further comprising, in response to the pairing of the simulated image with the predicted real image: generating one or more action consistency losses as a function of one or more action output comparisons, each of the action output comparisons being between a corresponding one of the simulated image predicted action outputs and a corresponding one of the predicted real image predicted action outputs; and updating the vision feature layers further based on the one or more action consistency losses. 4. The method of claim 3 , wherein the additional layers comprise a first control head and a second control head, wherein the simulated image predicted action outputs comprise a first simulated image predicted action output generated using the first control head and a second simulated image predicted action output generated using the second control head, and wherein the predicted real image predicted action outputs comprise a first predicted real image predicted action output generated using the first control head and a second predicted real image predicted action output generated using the second control head. 5. The method of claim 4 , wherein generating the action consistency losses comprises: generating a first action consistency loss based on comparison of the first simulated image predicted action output and the first predicted real image predicted action output; generating a second action consistency loss based on comparison of the second simulated image predicted action output and the second predicted real image predicted action output; and generating the action consistency loss as a function of the first action consistency loss and the second action consistency loss. 6. The method of claim 5 , further comprising, in response to the pairing of the simulated image with the predicted real image: backpropagating the first action consistency loss across the first control head; and backpropagating the second action consistency loss across the second control head; wherein updating the vision feature layers further based on the one or more action consistency losses comprises: backpropagating residuals, of the first action consistency loss and the second action consistency loss, across the vision feature layers. 7. The method of claim 5 , wherein the first simulated image predicted action output and the first predicted real image predicted action output each define a corresponding first set of values for controlling a first robotic component; and wherein the second simulated image predicted action output and the second predicted real image predicted action output each define a corresponding second set of values for controlling a second robotic component. 8. The method of claim 7 , wherein the first robotic component is one of a robot arm, a robot end effector, a robot base, or a robot head; and wherein the second robotic component is another one of the robot arm, the robot end effector, the robot base, or the robot head. 9. The method of claim 1 , further comprising: distorting the simulated image, using one or more distortion techniques, to generate a distorted simulated image; pairing the distorted simulated image with the predicted real image; processing the distorted simulated image, using the action machine learning model, to generate one or more distorted simulated image predicted action outputs, wherein processing the distorted simulated image comprises: generating a distorted simulated image embedding by processing the distorted simulated image using the vision feature layers; and processing the distorted simulated image embedding using the additional layers to generate the distorted simulated image predicted action outputs; in response to the pairing of the distorted simulated image with the predicted real image: generating an additional embedding consistency loss as a function of comparison of the distorted simulated image embedding and the predicted real embedding; and updating the vision feature layers based on the generated additional embedding consistency loss. 10. The method of claim 1 , further comprising: distorting the simulated image, using one or more distortion techniques, to generate a distorted simulated image; pairing the distorted simulated image with the simulated image; processing the distorted simulated image, using the action machine learning model, to generate one or more distorted simulated image predicted action outputs, wherein processing the distorted simulated image comprises: generating a distorted simulated image embedding by processing the distorted simulated image using the vision feature layers; and processing the distorted simulated image embedding using the additional layers to generate the distorted simulated image predicted action outputs; in response to the pairing of the distorted simulated image with the simulated image: generating an additional embedding consistency loss as a function of comparison of the distorted simulated image embedding and the simulated image embedding; and updating the vision feature layers based on the generated additional embedding consistency loss. 11. The method of claim 1 , wherein generating the predicted real image comprises: processing the simulated image using the simulation-to-real generator model to generate, as direct output from the simulation-to-real generator model, an original predicted real image; and distorting the original predicted real image, using one or more distortion techniques, to generate the predicted
using two or more images, e.g. averaging or subtraction · CPC title
Training; Learning · CPC title
using machine learning, e.g. neural networks · CPC title
exterior to a vehicle by using sensors mounted on the vehicle · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.