Deep reinforcement learning for robotic manipulation

US2021237266A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021237266-A1
Application numberUS-201917052679-A
CountryUS
Kind codeA1
Filing dateJun 14, 2019
Priority dateJun 15, 2018
Publication dateAug 5, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Using large-scale reinforcement learning to train a policy model that can be utilized by a robot in performing a robotic task in which the robot interacts with one or more environmental objects. In various implementations, off-policy deep reinforcement learning is used to train the policy model, and the off-policy deep reinforcement learning is based on self-supervised data collection. The policy model can be a neural network model. Implementations of the reinforcement learning utilized in training the neural network model utilize a continuous-action variant of Q-learning. Through techniques disclosed herein, implementations can learn policies that generalize effectively to previously unseen objects, previously unseen environments, etc.

First claim

Opening claim text (preview).

1 . A method of training a neural network model that represents a Q-function, the method implemented by a plurality of processors, and the method comprising: retrieving a robotic transition, the robotic transition generated based on data from an episode of a robot performing a robotic task, and the robotic transition including: state data that comprises vision data captured by a vision component at a state of the robot during the episode, next state data that comprises next vision data captured by the vision component at a next state of the robot during the episode, the next state being transitioned to from the state, an action executed to transition from the state to the next state, and a reward for the robotic transition; determining a target Q-value for the robotic transition, wherein determining the target Q-value comprises: performing an optimization over candidate robotic actions using, as an objective function, a version of a neural network model that represents the Q-function, wherein performing the optimization comprises generating Q-values for a subset of the candidate robotic actions that are considered in the optimization, wherein generating each of the Q-values is based on processing of the next state data and a corresponding one of the candidate robotic actions of the subset using the version of the neural network model, selecting, from the generated Q-values, a maximum Q-value, and determining the target Q-value based on the maximum Q-value and the reward; storing, in a training buffer: the state data, the action, and the target Q-value; retrieving, from the training buffer: the state data, the action, and the target Q-value; generating a predicted Q-value, wherein generating the predicted Q-value comprises processing the retrieved state data and the retrieved action using a current version of the neural network model, wherein the current version of the neural network model is updated relative to the version; generating a loss based on the predicted Q-value and the target Q-value; and updating the current version of the neural network model based on the loss. 2 . The method of claim 1 , wherein the robotic transition is generated based on offline data and is retrieved from an offline buffer. 3 . The method of claim 2 , wherein retrieving the robotic transition from the offline buffer is based on a dynamic offline sampling rate for sampling from the offline buffer, wherein the dynamic offline sampling rate decreases as a duration of training the neural network model increases. 4 . The method of claim 3 , further comprising generating the robotic transition by accessing an offline database that stores offline episodes. 5 . The method of claim 1 , wherein the robotic transition is generated based on online data and is retrieved from an online buffer, wherein the online data is generated by a robot performing episodes of the robotic task using a robot version of the neural network model. 6 . The method of claim 5 , wherein retrieving the robotic transition from the online buffer is based on a dynamic online sampling rate for sampling from the online buffer, wherein the dynamic online sampling rate increases as a duration of training the neural network model increases. 7 . The method of claim 5 , further comprising updating the robot version of the neural network model based on the loss. 8 . The method of claim 1 , wherein the action comprises a pose change for a component of the robot, wherein the pose change defines a difference between a pose of the component at the state and a next pose of the component at the next state. 9 . The method of claim 8 , wherein the component is an end effector and the pose change defines a translation difference for the end effector and a rotation difference for the end effector. 10 . The method of claim 9 , wherein the end effector is a gripper and the robotic task is a grasping task. 11 . The method of claim 8 , wherein the action further comprises a termination command when the next state is a terminal state of the episode. 12 . The method of claim 8 , wherein the action further comprises a component action command that defines a dynamic state, of the component, in the next state of the episode the dynamic state being in addition to translation and rotation of the component. 13 . The method of claim 12 , wherein the component is a gripper and wherein the dynamic state dictated by the component action command indicates that the gripper is to be closed. 14 . The method of claim 1 , wherein the state data further comprises a current status of a component of the robot. 15 . The method of claim 14 , wherein the component of the robot is a gripper and the current status indicates whether the gripper is opened or closed. 16 . The method of claim 1 , wherein the optimization is a stochastic optimization or is a cross-entropy method (CEM). 17 . The method of claim 1 , wherein performing the optimization over the candidate robotic actions comprises: selecting an initial batch of the candidate robotic actions; generating a corresponding one of the Q-values for each of the candidate robotic actions in the initial batch; selecting an initial subset of the candidate robotic actions in the initial batch based on the Q-values for the candidate robotic actions in the initial batch; fitting a Gaussian distribution to the selected initial subset of the candidate robotic actions; selecting a next batch of the candidate robotic actions based on the Gaussian distribution; and generating a corresponding one of the Q-values for each of the candidate robotic actions in the next batch. 18 . The method of claim 17 , wherein the maximum Q-value is one of the Q-values of the candidate robotic actions in the next batch and wherein selecting the maximum Q-value is based on the maximum Q-value being the maximum Q-value of the corresponding Q-values of the next batch. 19 . A method of training a neural network model that represents a policy, the method implemented by a plurality of processors, and the method comprising: retrieving a robotic transition, the robotic transition generated based on data from an episode of a robot performing a robotic task, and the robotic transition including state data and an action; determining a target value for the robotic transition, wherein determining the target value comprises: performing an optimization over candidate robotic actions using, as an objective function, a version of a neural network model that represents the policy; storing, in a training buffer: the state data, the action, and the target value; retrieving, from the training buffer: the state data, the action data, and the target value; generating a predicted value, wherein generating the predicted value comprises processing the retrieved state data and the retrieved action data using a current version of the neural network model, wherein the current version of the neural network model is updated relative to the version; generating a loss based on the predicted value and the target value; and updating the current version of the neural network model based on the loss. 20 . A method implemented by one or more processors of a robot during performance of a robotic task, the method comprising: receiving current state data for the robot, the current state data comprising current sensor data of the robot; selecting a robotic action to be performed for the robotic task, wherein selecting the robotic action comprises: performing an optimization over

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Reinforcement learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021237266A1 cover?
Using large-scale reinforcement learning to train a policy model that can be utilized by a robot in performing a robotic task in which the robot interacts with one or more environmental objects. In various implementations, off-policy deep reinforcement learning is used to train the policy model, and the off-policy deep reinforcement learning is based on self-supervised data collection. The poli…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Thu Aug 05 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).