Deep reinforcement learning with fast updating recurrent neural networks and slow updating recurrent neural networks

US10872293B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10872293-B2
Application numberUS-201916425717-A
CountryUS
Kind codeB2
Filing dateMay 29, 2019
Priority dateMay 29, 2018
Publication dateDec 22, 2020
Grant dateDec 22, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning. One of the methods includes selecting an action to be performed by the agent using both a slow updating recurrent neural network and a fast updating recurrent neural network that receives a fast updating input that includes the hidden state of the slow updating recurrent neural network.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of selecting actions to be performed by an agent interacting with an environment, the method comprising, at each of a plurality of time steps: receiving an observation characterizing a current state of the environment at the time step; and selecting an action to be performed by the agent in response to the observation based on (i) the observation, (ii) a first prior distribution over possible latent variables generated from a hidden state of a slow updating recurrent neural network that is updated at less than every time step, and (iii) a posterior distribution over the possible latent variables generated from a hidden state of a fast updating recurrent neural network that is updated at every time step, the selecting comprising: determining whether criteria for updating the hidden state of the slow updating recurrent neural network that is updated at less than every time step are satisfied at the time step; when the criteria are satisfied, processing a slow updating input, comprising the hidden state of the fast updating recurrent neural network that is updated at every time step, to update the hidden state of the slow updating recurrent neural network; processing a fast updating input for the time step comprising (i) the observation, (ii) the hidden state of the slow updating recurrent neural network using the fast updating recurrent neural network, and (iii) data defining the first prior distribution over possible latent variables to update the hidden state of the fast updating recurrent neural network to generate an updated hidden state of the fast updating recurrent neural network, wherein the first prior distribution is generated from the hidden state of the slow updating recurrent neural network as of the preceding time step; after updating the hidden state of the fast updating recurrent neural network: generating, from the updated hidden state of the fast updating recurrent neural network, the posterior distribution over possible latent variables; sampling a latent variable from the posterior distribution; and selecting the action to be performed by the agent based on the latent variable sampled from the posterior distribution; and updating parameters of the fast updating recurrent neural network, comprising: generating, from the hidden state of the slow updating recurrent neural network, a second prior distribution over possible latent variables; and updating the parameters of the fast updating recurrent neural network based at least in part on a divergence between (i) the second prior distribution generated from the hidden state of the slow updating recurrent neural network and (ii) the posterior distribution generated from the updated hidden state of the fast updating recurrent neural network. 2. The method of claim 1 , further comprising, at each of the plurality of time steps: when the criteria are not satisfied, refraining from updating the hidden state of the slow updating recurrent neural network before the hidden state is used as part of the input to the fast updating recurrent neural network for the time step. 3. The method of claim 1 , wherein the criteria are satisfied at less than all of the plurality of time steps. 4. The method of claim 1 , wherein the criteria are satisfied every N time steps, and wherein N is a fixed integer great than one. 5. The method of claim 1 , further comprising, at each of the plurality of time steps: determining a difference measure between the observation at the time step and the observation at the last time step at which the hidden state of the slow updating recurrent neural network was updated, wherein the criteria are satisfied at the time step only when the difference measure satisfies a threshold. 6. The method of claim 1 , wherein generating the posterior distribution, from the updated hidden state of the fast updating recurrent neural network, over possible latent variables, comprises: generating, from the updated hidden state of the fast updating recurrent neural network, posterior parameters of the posterior distribution over possible latent variables. 7. The method of claim 1 , wherein selecting the action comprises: processing the sampled latent variable using a policy neural network to generate a policy output, and selecting the action using the policy output. 8. The method of claim 1 , wherein the fast updating input further comprises the latent variable sampled at the preceding time step. 9. The method of claim 1 , wherein generating the second prior distribution, from the hidden state of the slow updating recurrent neural network, comprises: generating, from the hidden state of the slow updating recurrent neural network, prior parameters of the second prior distribution over possible latent variables. 10. The method of claim 1 , wherein the data defining the first prior distribution are the prior parameters of the first prior distribution generated at the preceding time step. 11. The method of claim 1 , further comprising: obtaining a reward in response to the agent performing the selected action; and wherein updating the parameters of the fast updating recurrent neural network comprises updating the parameters of the fast updating recurrent neural network based on the selected action and the reward using a reinforcement learning technique. 12. The method of claim 1 , wherein updating the parameters of the fast updating recurrent neural network comprises minimizing the divergence between the second prior distribution and the posterior distribution to regularize the latent variable. 13. The method of claim 11 , further comprising, at each of the plurality of time steps, generating, using the updated hidden state of the fast updating recurrent neural network, a value output that is an estimate of a return resulting from the environment being in the current state, and wherein the updating comprises updating the parameters of the fast updating recurrent neural network based on the reward, the selected action, and the value output using an actor-critic technique. 14. The method of claim 11 , further comprising, at each of the plurality of time steps: generating a respective auxiliary output for each of one or more auxiliary tasks using the updated hidden state of the fast updating recurrent neural network; and for each of the one or more auxiliary tasks, training the fast and slow updating recurrent neural networks on the auxiliary task based on the auxiliary output for the auxiliary task. 15. The method of claim 11 , wherein obtaining the reward comprises: obtaining data extracted from the environment after the selected action is performed, and mapping the obtained data to the reward using a reward mapping. 16. The method of claim 1 , wherein the slow updating recurrent neural network and the fast updating recurrent neural network share an external memory. 17. The method of claim 1 , further comprising, at each of the plurality of time steps: causing the agent to perform the selected action. 18. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for selecting actions to be performed by an agent interacting with an environment, the operations comprising, at each of a plurality of time steps: receiving an observation characterizing a current state of the environment at the time step; and selecting an action to be performed by the agent in response to the observation based on (i) the obser

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10872293B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning. One of the methods includes selecting an action to be performed by the agent using both a slow updating recurrent neural network and a fast updating recurrent neural network that receives a fast updating input that includes the hidden state of the slow updating recurrent n…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 22 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).