Distributional reinforcement learning

US10860920B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10860920-B2
Application numberUS-201916508046-A
CountryUS
Kind codeB2
Filing dateJul 10, 2019
Priority dateApr 14, 2017
Publication dateDec 8, 2020
Grant dateDec 8, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting an action to be performed by a reinforcement learning agent interacting with an environment. A current observation characterizing a current state of the environment is received. For each action in a set of multiple actions that can be performed by the agent to interact with the environment, a probability distribution is determined over possible Q returns for the action-current observation pair. For each action, a measure of central tendency of the possible Q returns with respect to the probability distributions for the action-current observation pair is determined. An action to be performed by the agent in response to the current observation is selected using the measures of central tendency.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more data processing apparatus for selecting an action to be performed by a reinforcement learning agent interacting with an environment, the method comprising: receiving a current observation characterizing a current state of the environment; for each action of a plurality of actions that can be performed by the agent to interact with the environment: processing the action and the current observation using a distributional Q network having a plurality of network parameters, wherein the distributional Q network is a deep neural network that is configured to process the action and the current observation in accordance with current values of the network parameters to generate a network output comprising a plurality of numerical values that collectively define a probability distribution over possible Q returns for the action current observation pair, wherein the network output comprises: (i) a respective score for each of a plurality of possible Q returns for the action current observation pair, or (ii) a respective value for each of a plurality of parameters of a parametric probability distribution over possible Q returns for the action-current observation pair, and wherein each possible Q return is an estimate of a return that would result from the agent performing the action in response to the current observation, and determining a measure of central tendency of the possible Q returns with respect to the probability distribution for the action-current observation pair; and selecting an action from the plurality of possible actions to be performed by the agent in response to the current observation using the measures of central tendency for the actions. 2. The method of claim 1 , wherein selecting an action to be performed by the agent comprises: selecting an action having the highest measure of central tendency. 3. The method of claim 1 , wherein selecting an action to be performed by the agent comprises: selecting an action having the highest measure of central tendency with probability 1ε and selecting an action randomly from the plurality of actions with probability ε. 4. The method of claim 1 , wherein the measure of central tendency is a mean of the possible Q returns. 5. The method of claim 4 , wherein determining the mean of the possible Q returns with respect to the probability distribution comprises: determining a respective probability for each of the plurality of possible Q returns from the network output; weighting each possible Q return by the probability for the possible Q return; and determining the mean by summing the weighted possible Q returns. 6. A method performed by one or more data processing apparatus for training a distributional Q network, the method comprising: obtaining an experience tuple that includes (i) a current training observation, (ii) a current action performed by an agent in response to the current training observation, (iii) a current reward received in response to the agent performing the current action, and (iv) a next training observation characterizing a state that an environment transitioned into as a result of the agent performing the current action; determining a respective current probability for each Q return of a plurality of possible Q returns, comprising: processing the current training observation and the current action using the distributional Q network and in accordance with current values of network parameters to generate a current network output comprising a plurality of numerical values that collectively define a current probability distribution over possible Q returns for the current action-current training observation pair, wherein the network output comprises: (i) a respective current score for each of a plurality of possible Q returns for the current action-current training observation pair, or (ii) a respective current value for each of a plurality of parameters of a parametric probability distribution over possible Q returns for the current action-current training observation pair; for each action in a plurality of actions: processing the action and the next training observation using a target distributional Q network and in accordance with current values of target network parameters of the distributional Q network to generate a next network output for the action-next training observation pair comprising a plurality of numerical values that collectively define a next probability distribution over possible Q returns for the action-next training observation pair, wherein the network output comprises: (i) a respective next score for each of a plurality of possible Q returns for the action-next training observation pair, or (ii) a respective next value for each of a plurality of parameters of a parametric probability distribution over possible Q returns for the action-next training observation pair, wherein the target distributional Q network has the same neural network architecture as the distributional Q network but the current values of the target network parameters are different from the current values of the network parameters; and determining a measure of central tendency of the possible Q returns with respect to the respective next probability distribution for the action-next training observation pair; determining an argmax action, wherein the argmax action is an action from the plurality of actions for which the measure of central tendency of the possible Q returns is highest; determining a respective projected sample update for each of the possible Q returns using the current reward and the argmax action; determining a gradient with respect to the network parameters of a loss function that depends on the projected sample updates for the possible Q returns and the current probabilities for the possible Q returns; and updating the current values of the network parameters using the gradient. 7. The method of claim 6 , wherein determining a respective projected sample update for each of the possible Q returns using the current reward and the argmax action comprises: determining a respective sample update for each of the possible Q returns from the current reward; and determining the respective projected sample update for each of the possible Q returns from the respective sample updates and the probabilities in the next probability distribution for the argmax action-next training observation pair. 8. The method of claim 7 , wherein the respective sample update for each of the possible Q returns is equal to the current reward plus a product of a discount factor and the possible Q return subject to a constraint that the respective sample update not be less than a smallest possible Q return of the plurality of possible Q returns and not be greater than a largest possible Q return of the plurality of possible Q returns. 9. The method of claim 7 , wherein determining the respective projected sample update for each of the possible Q returns from the respective sample updates and the probabilities in the next probability distribution for the argmax action-next training observation pair comprises, for each possible Q return: distributing the probability for the possible Q return in the next probability distribution for the argmax action-next training observation pair to at least some of the projected sample updates with a strength that is based on, for each projected sample update, the distance between the sample update for the possible Q return and the corresponding possible Q return for the projected sample update. 10. The method of claim 8 , wherein the loss function is a Kullback-Leibler divergence between (i) the respective projected sample updates and (ii) the current probability

Assignees

Inventors

Classifications

  • Reinforcement learning · CPC title

  • G06N3/047Primary

    Probabilistic or stochastic networks · CPC title

  • Feedforward networks · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10860920B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting an action to be performed by a reinforcement learning agent interacting with an environment. A current observation characterizing a current state of the environment is received. For each action in a set of multiple actions that can be performed by the agent to interact with the envir…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/047. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 08 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).