What is claimed is:
1. A method of servoing an end effector of a robot, comprising:
determining a query image, the query image including a target object to be interacted with by an end effector of the robot;
at a first time step, generating an action prediction based on processing the query image, a scene image, and a previous action representation using a neural network model, wherein the scene image is captured by a vision component associated with the robot and captures the target object and the end effector of the robot, wherein the previous action representation is a previous action prediction of a previous time step, and wherein the neural network model includes one or more recurrent layers each including a plurality of memory units;
controlling the end effector of the robot based on the action prediction of the first time step;
at a second time step, generating an additional action prediction immediately subsequent to generating the action prediction of the first time step, the immediately subsequent action prediction generated based on processing the query image, an additional scene image, and the action prediction using the neural network model, wherein the additional scene image is captured by the vision component after controlling the end effector based on the action prediction of the first time step and captures the target object and the end effector; and
controlling the end effector of the robot based on the additional action prediction.
2. The method of claim 1 , wherein generating the action prediction of the first time step based on processing the query image, the scene image, and the previous action representation using the neural network model comprises:
processing the query image and the scene image using a plurality of visual layers of a visual portion of the neural network model to generate visual layers output;
processing the previous action representation using one or more action layers of an action portion of the neural network model to generate action output; and
combining the visual layers output and the action output and processing the combined visual layers output and action output using a plurality of policy layers of the neural network model, the policy layers including the one or more recurrent layers.
3. The method of claim 2 , wherein the plurality of memory units of the one or more recurrent layers comprise long short-term memory units.
4. The method of claim 2 , wherein processing the query image and the scene image using the plurality of visual layers of the visual portion of the neural network model to generate visual layers output comprises:
processing the query image over a first convolutional neural network portion of the visual layers to generate a query image embedding;
processing the scene image over a second convolutional neural network portion of the visual layers to generate a scene image embedding; and
generating the visual layers output based on the query image embedding and the scene image embedding.
5. The method of claim 4 , wherein generating the visual layers output based on the query image embedding and the scene image embedding comprises processing the query image embedding and the scene image embedding over one or more additional layers of the visual layers.
6. The method of claim 1 , wherein the action prediction of the first time step represents a velocity vector for displacement of the end effector in a robot frame of the robot.
7. The method of claim 1 , wherein the determining the query image is based on user interface input from a user.
8. The method of claim 7 , wherein the user interface input is typed or spoken user interface input, and wherein determining the query image based on user interface input from the user comprises:
selecting the query image, from a plurality of stock images, based on data, associated with the selected query image, matching one or more terms determined based on the user interface input.
9. The method of claim 1 , wherein determining the query image based on user interface input from the user comprises:
causing the scene image or a previous scene image to be presented to the user via a computing device;
wherein the user interface input is received via the computing device and indicates a subset of the presented scene image or previous scene image; and
generating the query image based on a crop of the scene image or the previous scene image, wherein the crop is determined based on the user interface input.
10. The method of claim 1 , wherein the query image is generated based on an image captured by the vision component of the robot.
11. The method of claim 1 , wherein the query image, the scene image, and the additional scene image are each two dimensional images.
12. A real robot comprising:
an end effector;
a vision component;
memory storing instructions and a neural network model;
one or more processors operable to execute the instructions to:
determine a query image, the query image including a target object to be interacted with by an end effector of the robot;
at a first time step, generate an action prediction based on processing the query image, a scene image, and a previous action representation using the neural network model, wherein the scene image is captured by the vision component and captures the target object and the end effector of the robot, wherein the previous action representation is a previous action prediction of a previous time step;
control the end effector of the robot based on the action prediction of the first time step;
at a second time step, generate an additional action prediction immediately subsequent to generating the action prediction of the first time step, the immediately subsequent action prediction generated based on processing the query image, an additional scene image, and the action prediction using the neural network model, wherein the additional scene image is captured by the vision component after controlling the end effector based on the action prediction of the first time step and captures the target object and the end effector; and
control the end effector of the robot based on the additional action prediction.
13. The real robot of claim 12 , wherein in generating the action prediction of the first time step based on processing the query image, the scene image, and the previous action representation using the neural network mode, one or more of the processors are to:
process the query image and the scene image using a plurality of visual layers of a visual portion of the neural network model to generate visual layers output;
processing the previous action representation using one or more action layers of an action portion of the neural network model to generate action output; and
combine the visual layers output and the action output and process the combined visual layers output and action output using a plurality of policy layers of the neural network model, the policy layers including one or more recurrent layers.
14. The real robot of claim 13 , wherein in processing the query image and the scene image using the plurality of visual layers of the visual portion of the neural network model to generate visual layers output, one or more of the processors are to:
process the query image over a first convolutional neural network portion of the visual layers to generate a query image embedding;
process the scene image over a second convolutional neural network portion of the visual layers to generate a scene image embedding; and
generate the visual layers output based on the query image embedding and the scene image embedding.
15. The real robot o