Reinforcement learning using distributed prioritized replay

US12277497B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12277497-B2
Application numberUS-202318131753-A
CountryUS
Kind codeB2
Filing dateApr 6, 2023
Priority dateOct 27, 2017
Publication dateApr 15, 2025
Grant dateApr 15, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network used to select actions to be performed by an agent interacting with an environment. One of the systems includes (i) a plurality of actor computing units, in which each of the actor computing units is configured to maintain a respective replica of the action selection neural network and to perform a plurality of actor operations, and (ii) one or more learner computing units, in which each of the one or more learner computing units is configured to perform a plurality of learner operations.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for training an action selection neural network having a plurality of network parameters and used to select actions to be performed by an agent interacting with an environment, the system being implemented using one or more computers and comprising: a plurality of actor units, each of the actor units configured to maintain a respective replica of the action selection neural network and to perform actor operations in parallel with other actor units, the actor operations comprising: receiving an observation characterizing a current state of an instance of the environment, selecting an action to be performed by the agent using the action selection neural network replica and in accordance with current values of the network parameters, obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action, generating a new experience tuple from the observation, the selected action, and the transition data, determining an initial priority for the new experience tuple, comprising: determining a learning error for the new experience tuple according to a reinforcement learning technique, and determining the initial priority from the learning error; and storing the new experience tuple and the initial priority that is determined for the new experience tuple based on the learning error. 2. The system of claim 1 , wherein the new experience tuple and the initial priority are stored in a shared memory. 3. The system of claim 2 , further comprising one or more learner computing units, wherein each of the one or more learner computing units is configured to perform learner operations comprising: sampling a batch of experience tuples from the shared memory based on the priorities for the experience tuples in the shared memory; and determining, using the sampled experience tuples, an update to the network parameters using the reinforcement learning technique. 4. The system of claim 3 , wherein the learner operations further comprise: determining for each sampled experience tuple a respective updated priority; and updating the shared memory to associate the updated priorities with the sampled experience tuples. 5. The system of claim 3 , wherein the learner operations further comprise: determining whether criteria for removing any experience tuples from the shared memory are satisfied; and when the criteria are satisfied, updating the shared memory to remove one or more of the tuples. 6. The system of claim 3 , wherein the learner operations further comprise: determining whether criteria for updating the actor units are satisfied; and when the criteria are satisfied, transmitting updated parameter values to the actor units. 7. The system of claim 1 , wherein the initial priority is an absolute value of the learning error. 8. The system of claim 1 , wherein two or more of the actor units select actions using different exploration policies. 9. The system of claim 8 , wherein the different exploration policies are epsilon-greedy policies with different values of epsilon. 10. The system of claim 1 , wherein the reinforcement learning technique is an n-step Q learning technique or an actor-critic technique. 11. The system of claim 1 , wherein obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action comprises: selecting additional actions to be performed by the agent in response to subsequent observations using the action selection neural network replica to generate an n-step transition. 12. One or more non-transitory computer readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training an action selection neural network having a plurality of network parameters and used to select actions to be performed by an agent interacting with an environment, the operations comprising: maintaining a plurality of actor units, each of the actor units configured to maintain a respective replica of the action selection neural network and to perform actor operations in parallel with other actor units; and for each of the plurality of actor units, performing actor operations using the actor unit, the actor operations comprising: receiving an observation characterizing a current state of an instance of the environment, selecting an action to be performed by the agent using the action selection neural network replica and in accordance with current values of the network parameters, obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action, generating a new experience tuple from the observation, the selected action, and the transition data, determining an initial priority for the new experience tuple, comprising: determining a learning error for the new experience tuple according to a reinforcement learning technique, and determining the initial priority from the learning error; and storing the new experience tuple and the initial priority that is determined for the new experience tuple based on the learning error. 13. A computer-implemented method for training an action selection neural network having a plurality of network parameters and used to select actions to be performed by an agent interacting with an environment, the method comprising: maintaining a plurality of actor units, each of the actor units configured to maintain a respective replica of the action selection neural network and to perform actor operations in parallel with other actor units; and for each of the plurality of actor units, performing actor operations using the actor unit, the actor operations comprising: receiving an observation characterizing a current state of an instance of the environment, selecting an action to be performed by the agent using the action selection neural network replica and in accordance with current values of the network parameters, obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action, generating a new experience tuple from the observation, the selected action, and the transition data, determining an initial priority for the new experience tuple, comprising: determining a learning error for the new experience tuple according to a reinforcement learning technique, and determining the initial priority from the learning error; and storing the new experience tuple and the initial priority that is determined for the new experience tuple based on the learning error. 14. The method of claim 13 , wherein the new experience tuple and the initial priority are stored in a shared memory. 15. The method of claim 14 , further comprising: maintaining one or more learner computing units; and for each of the one or more learner computing units: sampling, using the learner computing unit, a batch of experience tuples from the shared memory based on the priorities for the experience tuples in the shared memory; and determining, using the sampled experience tuples, an update to the network parameters using the reinforcement learning technique. 16. The method of claim 15 , wherein for each of the one or more learner computing units, the method further comprises: determining for each sampled experience tuple a respective updated priority; and updating, using the learner computing unit, the shared memory to associate the updated priorities with the sampled experience tuples. 17. The method of claim 15 , wherein for

Assignees

Inventors

Classifications

  • Feedforward networks · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

  • Architecture, e.g. interconnection topology · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12277497B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network used to select actions to be performed by an agent interacting with an environment. One of the systems includes (i) a plurality of actor computing units, in which each of the actor computing units is configured to maintain a respective replica of the ac…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).