Methods and apparatus for reinforcement learning
US-2017278018-A1 · Sep 28, 2017 · US
US10445641B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10445641-B2 |
| Application number | US-201615016173-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 4, 2016 |
| Priority date | Feb 6, 2015 |
| Publication date | Oct 15, 2019 |
| Grant date | Oct 15, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for distributed training of reinforcement learning systems. One of the methods includes receiving, by a learner, current values of the parameters of the Q network from a parameter server, wherein each learner maintains a respective learner Q network replica and a respective target Q network replica; updating, by the learner, the parameters of the learner Q network replica maintained by the learner using the current values; selecting, by the learner, an experience tuple from a respective replay memory; computing, by the learner, a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner; and providing, by the learner, the computed gradient to the parameter server.
Opening claim text (preview).
What is claimed is: 1. A system for training a reinforcement learning system, the reinforcement learning system comprising an agent that interacts with an environment by receiving observations characterizing a current state of the environment and selecting an action to be performed from a predetermined set of actions, wherein the agent selects an action to be performed using a Q network, wherein the Q network is a deep neural network that is configured to receive as input an observation and an action and to generate a neural network output from the input in accordance with a set of parameters, wherein training the reinforcement learning system comprises adjusting the values of the set of parameters of the Q network, and wherein the system comprises: a plurality of computers configured to implement a plurality of learners, wherein each learner executes on a respective computing unit, wherein each learner is configured to operate independently of each other learner, wherein each learner maintains a respective learner Q network replica and a respective target Q network replica, and wherein each learner is further configured to repeatedly perform operations comprising: receiving, from a parameter server, current values of the parameters of the Q network; updating the parameters of the learner Q network replica maintained by the learner using the current values; selecting an experience tuple from a respective replay memory; computing a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner; and providing the computed gradient to the parameter server. 2. The system of claim 1 , wherein the one or more computers are further configured to implement one or more actors, wherein each actor executes on a respective computing unit, wherein each actor is configured to operate independently of each other actor, wherein each actor interacts with a respective replica of the environment, wherein each actor maintains a respective actor Q network replica, and wherein each actor is further configured to repeatedly perform operations comprising: receiving, from the parameter server, current values of the parameters of the Q network; updating the values of the parameters of the actor Q network replica maintained by the actor using the current values; receiving an observation characterizing a current state of the environment replica interacted with by the actor: selecting an action to be performed in response to the observation using the actor Q network replica maintained by the actor; receiving a reward in response to the action being performed and a next observation characterizing a next state of the environment replica interacted with by the actor; generating an experience tuple that comprises the current observation, the action selected, the reward, and the next observation; and storing the experience tuple in a respective replay memory. 3. The system of claim 2 , further comprising: the parameter server, wherein the parameter server is configured to repeatedly perform operations comprising: receiving a succession of gradients from the plurality of learners; computing updates to the values of the parameters of the Q network using the gradients; updating the values of the parameters of the Q network using the computed updates; and providing the updated values of the parameters to the one or more actors and the plurality of learners. 4. The system of claim 3 , wherein the parameter server comprises a plurality of parameter server shards, wherein each shard is configured to maintain values of a respective disjoint partition of the parameters of the Q network, and wherein each shard is configured to operate asynchronously with respect to every other shard. 5. The system of claim 3 , wherein the operations that the parameter server is configured to perform further comprise: determining whether criteria are satisfied for updating the parameters of the target Q network replicas maintained by the learners; and when the criteria are satisfied, providing data to the learners indicating that the updated parameter values are to be used to update the parameters of the target Q network replicas. 6. The system of claim 5 , wherein the operations that each of the learners is configured to perform further comprise: receiving data indicating that the updated parameter values are to be used to update the parameters of the target Q network replica maintained by the learner; and updating the parameters of the target Q network replica maintained by the learner using the updated parameter values. 7. The system of claim 2 , wherein each of the learners is bundled with a respective one of the actors and a respective replay memory, wherein each bundle of an actor, a learner, and a replay memory is implemented on a respective computing unit, wherein each bundle is configured to operate independently from each other bundle, and wherein, for each bundle, the learner in the bundle selects from among experience tuples generated by the actor in the bundle. 8. The system of claim 7 , wherein, for each bundle, the current values of the parameters of the actor Q network replica maintained by the actor in the bundle are synchronized with the current values of the parameters of the learner Q network replica maintained by the learner in the bundle. 9. The system of claim 2 , wherein selecting an action to be performed in response to the observation using the actor Q network replica maintained by the actor comprises: determining an action from the predetermined set of actions that, when provided as input to the actor Q network replica maintained by the actor with the current observation, generates a largest actor Q network replica output. 10. The system of claim 9 , wherein selecting an action to be performed in response to the observation using the actor Q network replica maintained by the actor further comprises: selecting a random action from the set of predetermined actions with probability c and selecting the determined action with probability 1−ε. 11. The system of claim 1 , wherein computing a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner comprises: processing the action from experience tuple and the current observation from the experience tuple using the learner Q network replica maintained by the learner to determine a learner Q network replica output; determining a largest target Q network replica output that is generated by processing any of the actions in the predetermined set of actions with the next observation from the experience tuple using the target Q network replica maintained by the learner; and computing the gradient using the learner Q network replica output, the largest target Q network replica output, and the reward from the experience tuple. 12. A method for training a reinforcement learning system, the reinforcement learning system comprising an agent that interacts with an environment by receiving observations characterizing a current state of the environment and selecting an action to be performed from a predetermined set of actions, wherein the agent selects an action to be performed using a Q network, wherein the Q network is a deep neural network that is configured to receive as input an observation and an action and to generate a neural network output from the input in accordance with a set of parameters, wherein training the reinforcement learning system comprises adjusting the values of the set of parameters of the Q network, wherein the method comprises: rece
Related publications grouped by family.
Answers are generated from the same data shown on this page.