What technology area does this patent fall under?

Primary CPC classification G06N3/045. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 15 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Distributed training of reinforcement learning systems

US10445641B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10445641-B2
Application number	US-201615016173-A
Country	US
Kind code	B2
Filing date	Feb 4, 2016
Priority date	Feb 6, 2015
Publication date	Oct 15, 2019
Grant date	Oct 15, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for distributed training of reinforcement learning systems. One of the methods includes receiving, by a learner, current values of the parameters of the Q network from a parameter server, wherein each learner maintains a respective learner Q network replica and a respective target Q network replica; updating, by the learner, the parameters of the learner Q network replica maintained by the learner using the current values; selecting, by the learner, an experience tuple from a respective replay memory; computing, by the learner, a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner; and providing, by the learner, the computed gradient to the parameter server.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for training a reinforcement learning system, the reinforcement learning system comprising an agent that interacts with an environment by receiving observations characterizing a current state of the environment and selecting an action to be performed from a predetermined set of actions, wherein the agent selects an action to be performed using a Q network, wherein the Q network is a deep neural network that is configured to receive as input an observation and an action and to generate a neural network output from the input in accordance with a set of parameters, wherein training the reinforcement learning system comprises adjusting the values of the set of parameters of the Q network, and wherein the system comprises: a plurality of computers configured to implement a plurality of learners, wherein each learner executes on a respective computing unit, wherein each learner is configured to operate independently of each other learner, wherein each learner maintains a respective learner Q network replica and a respective target Q network replica, and wherein each learner is further configured to repeatedly perform operations comprising: receiving, from a parameter server, current values of the parameters of the Q network; updating the parameters of the learner Q network replica maintained by the learner using the current values; selecting an experience tuple from a respective replay memory; computing a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner; and providing the computed gradient to the parameter server. 2. The system of claim 1 , wherein the one or more computers are further configured to implement one or more actors, wherein each actor executes on a respective computing unit, wherein each actor is configured to operate independently of each other actor, wherein each actor interacts with a respective replica of the environment, wherein each actor maintains a respective actor Q network replica, and wherein each actor is further configured to repeatedly perform operations comprising: receiving, from the parameter server, current values of the parameters of the Q network; updating the values of the parameters of the actor Q network replica maintained by the actor using the current values; receiving an observation characterizing a current state of the environment replica interacted with by the actor: selecting an action to be performed in response to the observation using the actor Q network replica maintained by the actor; receiving a reward in response to the action being performed and a next observation characterizing a next state of the environment replica interacted with by the actor; generating an experience tuple that comprises the current observation, the action selected, the reward, and the next observation; and storing the experience tuple in a respective replay memory. 3. The system of claim 2 , further comprising: the parameter server, wherein the parameter server is configured to repeatedly perform operations comprising: receiving a succession of gradients from the plurality of learners; computing updates to the values of the parameters of the Q network using the gradients; updating the values of the parameters of the Q network using the computed updates; and providing the updated values of the parameters to the one or more actors and the plurality of learners. 4. The system of claim 3 , wherein the parameter server comprises a plurality of parameter server shards, wherein each shard is configured to maintain values of a respective disjoint partition of the parameters of the Q network, and wherein each shard is configured to operate asynchronously with respect to every other shard. 5. The system of claim 3 , wherein the operations that the parameter server is configured to perform further comprise: determining whether criteria are satisfied for updating the parameters of the target Q network replicas maintained by the learners; and when the criteria are satisfied, providing data to the learners indicating that the updated parameter values are to be used to update the parameters of the target Q network replicas. 6. The system of claim 5 , wherein the operations that each of the learners is configured to perform further comprise: receiving data indicating that the updated parameter values are to be used to update the parameters of the target Q network replica maintained by the learner; and updating the parameters of the target Q network replica maintained by the learner using the updated parameter values. 7. The system of claim 2 , wherein each of the learners is bundled with a respective one of the actors and a respective replay memory, wherein each bundle of an actor, a learner, and a replay memory is implemented on a respective computing unit, wherein each bundle is configured to operate independently from each other bundle, and wherein, for each bundle, the learner in the bundle selects from among experience tuples generated by the actor in the bundle. 8. The system of claim 7 , wherein, for each bundle, the current values of the parameters of the actor Q network replica maintained by the actor in the bundle are synchronized with the current values of the parameters of the learner Q network replica maintained by the learner in the bundle. 9. The system of claim 2 , wherein selecting an action to be performed in response to the observation using the actor Q network replica maintained by the actor comprises: determining an action from the predetermined set of actions that, when provided as input to the actor Q network replica maintained by the actor with the current observation, generates a largest actor Q network replica output. 10. The system of claim 9 , wherein selecting an action to be performed in response to the observation using the actor Q network replica maintained by the actor further comprises: selecting a random action from the set of predetermined actions with probability c and selecting the determined action with probability 1−ε. 11. The system of claim 1 , wherein computing a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner comprises: processing the action from experience tuple and the current observation from the experience tuple using the learner Q network replica maintained by the learner to determine a learner Q network replica output; determining a largest target Q network replica output that is generated by processing any of the actions in the predetermined set of actions with the next observation from the experience tuple using the target Q network replica maintained by the learner; and computing the gradient using the learner Q network replica output, the largest target Q network replica output, and the reward from the experience tuple. 12. A method for training a reinforcement learning system, the reinforcement learning system comprising an agent that interacts with an environment by receiving observations characterizing a current state of the environment and selecting an action to be performed from a predetermined set of actions, wherein the agent selects an action to be performed using a Q network, wherein the Q network is a deep neural network that is configured to receive as input an observation and an action and to generate a neural network output from the input in accordance with a set of parameters, wherein training the reinforcement learning system comprises adjusting the values of the set of parameters of the Q network, wherein the method comprises: rece

Assignees

Deepmind Tech Ltd

Inventors

Classifications

G06N3/045Primary
Combinations of networks · CPC title
G06N3/08Primary
Learning methods · CPC title
G06N3/0472
Physics · mapped topic
G06N3/0454
Physics · mapped topic
G06N3/098Primary
Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

View patent family 55650664

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10445641B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for distributed training of reinforcement learning systems. One of the methods includes receiving, by a learner, current values of the parameters of the Q network from a parameter server, wherein each learner maintains a respective learner Q network replica and a respective target Q network replica; …
Who is the assignee on this patent?: Deepmind Tech Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 15 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).