Deep reinforcement learning for robotic manipulation

US11897133B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11897133-B2
Application numberUS-202217878186-A
CountryUS
Kind codeB2
Filing dateAug 1, 2022
Priority dateSep 15, 2016
Publication dateFeb 13, 2024
Grant dateFeb 13, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations utilize deep reinforcement learning to train a policy neural network that parameterizes a policy for determining a robotic action based on a current state. Some of those implementations collect experience data from multiple robots that operate simultaneously. Each robot generates instances of experience data during iterative performance of episodes that are each explorations of performing a task, and that are each guided based on the policy network and the current policy parameters for the policy network during the episode. The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors, comprising: receiving a given instance of robot experience data generated by a given robot of a plurality of robots, wherein the given instance of the robot experience data is generated during a given episode of explorations of performing a task based on a given version of policy parameters of a policy network utilized by the given robot in generating the given instance; receiving additional instances of robot experience data from additional robots of the plurality of robots, the additional instances generated during episodes, by the additional robots, of explorations of performing the task based on the policy network; while the given robot and the additional robots continue the episodes of explorations of performing the task, generating a new version of the policy parameters of the policy network based on training of the policy network based at least in part on the given instance and the additional instances; providing the new version of the policy parameters to the given robot for performing of an immediately subsequent episode of explorations of performing the task by the given robot based on the new version of the policy parameters. 2. The method of claim 1 , wherein receiving the given instance occurs in one iteration of a plurality of experience data iterations of receiving instances of experience data from the given robot, the plurality of experience data iterations occurring at a first frequency. 3. The method of claim 2 , wherein training the policy network to generate the new version of the policy parameters comprises performing a plurality of training iterations including: a first training iteration of training of the policy network based at least in part on the given instance and the additional instances; and one or more additional training iterations of training of the policy network based on yet further instances of experience data from a plurality of the robots; wherein the training iterations occur at a second frequency that is a greater frequency than the first frequency of the experience data iterations. 4. The method of claim 1 , further comprising: performing, by the given robot, the immediately subsequent episode of explorations of performing the task by the given robot based on the new version of the policy parameters. 5. The method of claim 4 , further comprising: performing, at the additional robots, said episodes of task explorations on which the additional instances of experience data are based. 6. A system, comprising: memory storing instructions; one or more processors operable to execute the instructions, stored in the memory to: receive a given instance of robot experience data generated by a given robot of a plurality of robots, wherein the given instance of the robot experience data is generated during a given episode of explorations of performing a task based on a given version of policy parameters of a policy network utilized by the given robot in generating the given instance; receive additional instances of robot experience data from additional robots of the plurality of robots, the additional instances generated during episodes, by the additional robots, of explorations of performing the task based on the policy network; while the given robot and the additional robots continue the episodes of explorations of performing the task, generate a new version of the policy parameters of the policy network based on training of the policy network based at least in part on the given instance and the additional instances; provide the new version of the policy parameters to the given robot for performing of an immediately subsequent episode of explorations of performing the task by the given robot based on the new version of the policy parameters. 7. The system of claim 6 , wherein receiving the given instance occurs in one iteration of a plurality of experience data iterations of receiving instances of experience data from the given robot, the plurality of experience data iterations occurring at a first frequency. 8. The system of claim 7 , wherein in training the policy network to generate the new version of the policy parameters one or more of the processors are to perform a plurality of training iterations including: a first training iteration of training of the policy network based at least in part on the given instance and the additional instances; and one or more additional training iterations of training of the policy network based on yet further instances of experience data from a plurality of the robots; wherein the training iterations occur at a second frequency that is a greater frequency than the first frequency of the experience data iterations. 9. The system of claim 6 , wherein the system further comprises the given robot. 10. The system of claim 9 , wherein the given robot comprises one or more robot processors to: perform, based on the new version of the policy parameters, the immediately subsequent episode of explorations of performing the task by the given robot. 11. The system of claim 10 , wherein the system further comprises the additional robots. 12. The system of claim 11 , where the additional robots comprise one or more additional robot processors to: perform, at the additional robots, said episodes of task explorations on which the additional instances of experience data are based. 13. A method implemented by one or more processors, comprising: during performance of a plurality of episodes by each of a plurality of agents, each of the episodes including performing a task based on a policy neural network representing a reinforcement learning policy for the task: storing, in a buffer, instances of experience data generated during the episodes by the plurality of agents, each of the instances of the experience data being generated during a corresponding one of the episodes, and being generated at least in part on corresponding output generated using the policy neural network with corresponding policy parameters for the policy neural network for the corresponding episode; iteratively generating updated policy parameters of the policy neural network, wherein each of the iterations of the iteratively generating comprises generating the updated policy parameters using a group of one or more of the instances of the experience data in the buffer during the iteration; and by each of the agents in conjunction with a start of each of a plurality of the episodes performed by the agents, updating the policy neural network to be used by the agents in the episode, wherein updating the policy neural network comprises using the updated policy parameters of a most recent iteration of the iteratively generating the updated policy parameters. 14. The method of claim 13 , wherein each of the updated policy parameters defines a corresponding value for a corresponding node of a corresponding layer of the policy neural network. 15. The method of claim 13 , wherein the instances of the experience data for each of the agents are stored in the buffer at corresponding frequencies that are each lower than a second frequency of iteratively generating the updated policy parameters. 16. The method of claim 13 , wherein storing, in the buffer, the instances of the experience data is performed by one more of the processors in a first thread and wherein the iteratively generating is performed by one or more of the processors in a second thread that is separate from the first thread. 17. The method of claim 16 , wherein the first thread is performed by a first grou

Assignees

Inventors

Classifications

  • Feedforward networks · CPC title

  • Reinforcement learning · CPC title

  • B25J9/161Primary

    Hardware, e.g. neural networks, fuzzy logic, interfaces, processor · CPC title

  • learning, adaptive, model based, rule based expert control · CPC title

  • characterised by motion, path, trajectory planning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11897133B2 cover?
Implementations utilize deep reinforcement learning to train a policy neural network that parameterizes a policy for determining a robotic action based on a current state. Some of those implementations collect experience data from multiple robots that operate simultaneously. Each robot generates instances of experience data during iterative performance of episodes that are each explorations of …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification B25J9/161. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Tue Feb 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).