Viewpoint invariant visual servoing of robot end effector using recurrent neural network

US11701773B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11701773-B2
Application numberUS-201816622181-A
CountryUS
Kind codeB2
Filing dateDec 4, 2018
Priority dateDec 5, 2017
Publication dateJul 18, 2023
Grant dateJul 18, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Training and/or using a recurrent neural network model for visual servoing of an end effector of a robot. In visual servoing, the model can be utilized to generate, at each of a plurality of time steps, an action prediction that represents a prediction of how the end effector should be moved to cause the end effector to move toward a target object. The model can be viewpoint invariant in that it can be utilized across a variety of robots having vision components at a variety of viewpoints and/or can be utilized for a single robot even when a viewpoint, of a vision component of the robot, is drastically altered. Moreover, the model can be trained based on a large quantity of simulated data that is based on simulator(s) performing simulated episode(s) in view of the model. One or more portions of the model can be further trained based on a relatively smaller quantity of real training data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of servoing an end effector of a robot, comprising: determining a query image, the query image including a target object to be interacted with by an end effector of the robot; at a first time step, generating an action prediction based on processing the query image, a scene image, and a previous action representation using a neural network model, wherein the scene image is captured by a vision component associated with the robot and captures the target object and the end effector of the robot, wherein the previous action representation is a previous action prediction of a previous time step, and wherein the neural network model includes one or more recurrent layers each including a plurality of memory units; controlling the end effector of the robot based on the action prediction of the first time step; at a second time step, generating an additional action prediction immediately subsequent to generating the action prediction of the first time step, the immediately subsequent action prediction generated based on processing the query image, an additional scene image, and the action prediction using the neural network model, wherein the additional scene image is captured by the vision component after controlling the end effector based on the action prediction of the first time step and captures the target object and the end effector; and controlling the end effector of the robot based on the additional action prediction. 2. The method of claim 1 , wherein generating the action prediction of the first time step based on processing the query image, the scene image, and the previous action representation using the neural network model comprises: processing the query image and the scene image using a plurality of visual layers of a visual portion of the neural network model to generate visual layers output; processing the previous action representation using one or more action layers of an action portion of the neural network model to generate action output; and combining the visual layers output and the action output and processing the combined visual layers output and action output using a plurality of policy layers of the neural network model, the policy layers including the one or more recurrent layers. 3. The method of claim 2 , wherein the plurality of memory units of the one or more recurrent layers comprise long short-term memory units. 4. The method of claim 2 , wherein processing the query image and the scene image using the plurality of visual layers of the visual portion of the neural network model to generate visual layers output comprises: processing the query image over a first convolutional neural network portion of the visual layers to generate a query image embedding; processing the scene image over a second convolutional neural network portion of the visual layers to generate a scene image embedding; and generating the visual layers output based on the query image embedding and the scene image embedding. 5. The method of claim 4 , wherein generating the visual layers output based on the query image embedding and the scene image embedding comprises processing the query image embedding and the scene image embedding over one or more additional layers of the visual layers. 6. The method of claim 1 , wherein the action prediction of the first time step represents a velocity vector for displacement of the end effector in a robot frame of the robot. 7. The method of claim 1 , wherein the determining the query image is based on user interface input from a user. 8. The method of claim 7 , wherein the user interface input is typed or spoken user interface input, and wherein determining the query image based on user interface input from the user comprises: selecting the query image, from a plurality of stock images, based on data, associated with the selected query image, matching one or more terms determined based on the user interface input. 9. The method of claim 1 , wherein determining the query image based on user interface input from the user comprises: causing the scene image or a previous scene image to be presented to the user via a computing device; wherein the user interface input is received via the computing device and indicates a subset of the presented scene image or previous scene image; and generating the query image based on a crop of the scene image or the previous scene image, wherein the crop is determined based on the user interface input. 10. The method of claim 1 , wherein the query image is generated based on an image captured by the vision component of the robot. 11. The method of claim 1 , wherein the query image, the scene image, and the additional scene image are each two dimensional images. 12. A real robot comprising: an end effector; a vision component; memory storing instructions and a neural network model; one or more processors operable to execute the instructions to: determine a query image, the query image including a target object to be interacted with by an end effector of the robot; at a first time step, generate an action prediction based on processing the query image, a scene image, and a previous action representation using the neural network model, wherein the scene image is captured by the vision component and captures the target object and the end effector of the robot, wherein the previous action representation is a previous action prediction of a previous time step; control the end effector of the robot based on the action prediction of the first time step; at a second time step, generate an additional action prediction immediately subsequent to generating the action prediction of the first time step, the immediately subsequent action prediction generated based on processing the query image, an additional scene image, and the action prediction using the neural network model, wherein the additional scene image is captured by the vision component after controlling the end effector based on the action prediction of the first time step and captures the target object and the end effector; and control the end effector of the robot based on the additional action prediction. 13. The real robot of claim 12 , wherein in generating the action prediction of the first time step based on processing the query image, the scene image, and the previous action representation using the neural network mode, one or more of the processors are to: process the query image and the scene image using a plurality of visual layers of a visual portion of the neural network model to generate visual layers output; processing the previous action representation using one or more action layers of an action portion of the neural network model to generate action output; and combine the visual layers output and the action output and process the combined visual layers output and action output using a plurality of policy layers of the neural network model, the policy layers including one or more recurrent layers. 14. The real robot of claim 13 , wherein in processing the query image and the scene image using the plurality of visual layers of the visual portion of the neural network model to generate visual layers output, one or more of the processors are to: process the query image over a first convolutional neural network portion of the visual layers to generate a query image embedding; process the scene image over a second convolutional neural network portion of the visual layers to generate a scene image embedding; and generate the visual layers output based on the query image embedding and the scene image embedding. 15. The real robot o

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11701773B2 cover?
Training and/or using a recurrent neural network model for visual servoing of an end effector of a robot. In visual servoing, the model can be utilized to generate, at each of a plurality of time steps, an action prediction that represents a prediction of how the end effector should be moved to cause the end effector to move toward a target object. The model can be viewpoint invariant in that i…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Tue Jul 18 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).