Deep reinforcement learning for robotic manipulation

US11400587B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11400587-B2
Application numberUS-201716333482-A
CountryUS
Kind codeB2
Filing dateSep 14, 2017
Priority dateSep 15, 2016
Publication dateAug 2, 2022
Grant dateAug 2, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations utilize deep reinforcement learning to train a policy neural network that parameterizes a policy for determining a robotic action based on a current state. Some of those implementations collect experience data from multiple robots that operate simultaneously. Each robot generates instances of experience data during iterative performance of episodes that are each explorations of performing a task, and that are each guided based on the policy network and the current policy parameters for the policy network during the episode. The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors, comprising: during performance of a plurality of episodes by each of a plurality of robots, each of the episodes being an exploration of performing a task based on a policy neural network representing a reinforcement learning policy for the task: storing, in a buffer, instances of robot experience data generated during the episodes by the robots, each of the instances of the robot experience data being generated during a corresponding one of the episodes, and being generated at least in part on corresponding output generated using the policy neural network with corresponding policy parameters for the policy neural network for the corresponding episode, wherein the instances of the experience data for a given robot of the plurality of robots are stored in the buffer at a first frequency; iteratively generating updated policy parameters of the policy neural network at a second frequency greater than the first frequency, wherein each of multiple iterations of the iteratively generating comprises generating the updated policy parameters using a group of one or more of the instances of the robot experience data in the buffer during the iteration; and by each of the robots in conjunction with a start of each of a plurality of the episodes performed by the robot, updating the policy neural network to be used by the robot in the episode, wherein updating the policy neural network comprises using the updated policy parameters that are of a most recent iteration of the iteratively generating the updated policy parameters. 2. The method of claim 1 , wherein each of the updated policy parameters defines a corresponding value for a corresponding node of a corresponding layer of the policy neural network. 3. The method of claim 1 , wherein the instances of the robot experience data for each of the robots are stored in the buffer at corresponding frequencies that are each lower than the second frequency. 4. The method of claim 1 , wherein storing, in the buffer, the instances of the robot experience data is performed by one more of the processors in a first thread and wherein the iteratively generating is performed by one or more of the processors in a second thread that is separate from the first thread. 5. The method of claim 4 , wherein the first thread is performed by a first group of one or more of the processors and the second thread is performed by a second group of one or more of the processors, the second group being non-overlapping with the first group. 6. The method of claim 1 , wherein each of the iterations of the iteratively generating comprise generating the updated policy parameters based on minimizing a loss function in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. 7. The method of claim 1 , wherein each of the iterations of the iteratively generating comprises off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. 8. The method of claim 7 , wherein the off-policy learning is Q-learning. 9. The method of claim 8 , wherein the Q-learning utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. 10. The method of claim 1 , wherein each of the instances of the experience data indicates a corresponding: beginning robot state, subsequent robot state transitioned to from the beginning robot state, action executed to transition from the beginning robot state to the subsequent robot state, and reward for the action; wherein the action executed to transition from the beginning robot state to the subsequent robot state is generated based on processing of the beginning robot state using the policy neural network with the updated policy parameters for the corresponding episode, and wherein the reward for the action is generated based on a reward function for the reinforcement learning policy. 11. The method of claim 1 , further comprising: based on one or more criteria, ceasing the performance of the plurality of episodes and ceasing the iteratively generating; providing, for use by one or more additional robots, the policy neural network with a most recently generated version of the updated policy parameters. 12. A method comprising: by one or more processors of a given robot of a plurality of robots: performing a given episode of explorations of performing a task based on a policy network having a first group of policy parameters; providing, in one iteration of a plurality of experience data iterations of providing experience data from the given robot, first instances of robot experience data generated based on the policy network during the given episode, wherein the plurality of experience data iterations occur at a first frequency; prior to performance, by the given robot, of a subsequent episode of performing the task based on the policy network: replacing one or more of the policy parameters of the first group with updated policy parameters, wherein the updated policy parameters are generated based on training of the policy network based on additional instances of robot experience data, generated by an additional robot during an additional robot episode of explorations of performing the task by the additional robot, wherein the performing the task by the additional robot is based on the policy network, and wherein the training of the policy network comprises a plurality of training iterations occurring at a second frequency that is greater than the first frequency, the plurality of training iterations including; a first training iteration of training of the policy network based at least in part on the first instances and the additional instances; and one or more additional training iterations of the policy network based on yet further instances of experience data from the plurality of the robots; wherein the subsequent episode immediately follows the given episode, and wherein performing the task based on the policy network in the subsequent episode comprises using the updated policy parameters in lieu of the replaced policy parameters. 13. The method of claim 12 , further comprising: generating, by one or more additional processors and during the performance of the subsequent episode, further updated policy parameters, wherein generating the further updated policy parameters is based on one or more of the first instances of robot experience data generated during the given episode; and providing the further updated policy parameters for use by the additional robot in performance of a corresponding episode by the additional robot. 14. The method of claim 13 , wherein the additional robot starts performance of the corresponding episode during performance of the subsequent episode by the given robot. 15. The method of claim 13 , wherein the further updated policy parameters are not utilized by the given robot in performance of any episodes by the given robot. 16. The method of claim 13 , further comprising: generating, by one or more of the additional processors, yet further updated policy parameters, wherein the yet further updated policy parameters are generated during the performance of the subsequent episode and are generated subsequent to generation of the further updated policy parameters; and providing the yet further updated policy parameters for use by the given robot in performance of a further subsequent episode, by the given robot, of performing the task based on the policy network;

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Feedforward networks · CPC title

  • Reinforcement learning · CPC title

  • characterised by motion, path, trajectory planning · CPC title

  • using neural networks only · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11400587B2 cover?
Implementations utilize deep reinforcement learning to train a policy neural network that parameterizes a policy for determining a robotic action based on a current state. Some of those implementations collect experience data from multiple robots that operate simultaneously. Each robot generates instances of experience data during iterative performance of episodes that are each explorations of …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/008. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 02 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).