Reinforcement learning using quantile credit assignment

US2024256883A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024256883-A1
Application numberUS-202418424561-A
CountryUS
Kind codeA1
Filing dateJan 26, 2024
Priority dateJan 26, 2023
Publication dateAug 1, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a neural network used to select actions to be performed by an agent interacting with an environment. Implementations of the system can take into account a level of luck in the environment, and hence whilst learning can account for outcomes that were caused by external factors as well as those dependent on the actions of the agent.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, implemented by one or more computers, of training an action selection neural network to control an agent to take actions in an environment, in response to observations characterizing states of the environment, to perform one or more tasks, the method comprising: maintaining a state action value distribution neural network, wherein the state action value distribution neural network is configured to process an observation at a time step to generate a state action value distribution over estimated returns from a state of the environment represented by the observation and for possible actions of a plurality of possible actions at the time step, wherein the state action value distribution defines a state action value for each of a plurality of quantile levels of the state action value distribution; and obtaining training data comprising, for each of a plurality of time steps, a tuple defining: an observation characterizing a state of an environment at a time step, an action taken by an agent at the time step, a reward received in response to the action; and, for each of a plurality of the tuples: processing the observation in the tuple representing a state of the environment, using the state action value distribution neural network, to determine the state action value distribution for the observation and for the action in the tuple; identifying a quantile level of the state action value distribution that is closest to a return based on the reward in the tuple; determining a training target based on at least a difference between the reward in the tuple and a value for the state of the environment determined i) from the state action value distribution at the identified quantile level, or ii) from a value distribution for the observation over estimated returns from the state of the environment, at the identified quantile level; training the action selection neural network using the training target, wherein the action selection neural network is configured to process an observation to generate an action selection output for controlling the agent to perform the task. 2 . The method of claim 1 , wherein the value for the state of the environment is determined from an average over the possible actions of state action values of the state action value distribution for the observation, at the identified quantile level. 3 . The method of claim 2 , comprising determining the value for the state of the environment from the state action value distribution at the identified quantile level, Q(x t , π, {circumflex over (τ)}), as Q(x t , π, {circumflex over (τ)})=Σ a π(α|x t ) Q(x t , a, {circumflex over (τ)}) where x t is the observation in the tuple, a is a possible action, π(a|x t ) denotes a probability of taking action a given observation x t determined from an action selection output of the action selection neural network, Q(x t , a, {circumflex over (τ)}) defines the state action value for the observation x t and for action a, and {circumflex over (τ)} denotes the identified quantile level. 4 . The method of claim 1 , wherein identifying the quantile level of the state action value distribution that is closest to the return based on the reward in the tuple, further comprises determining an estimate of the quantile level, {circumflex over (τ)}, for which Z=Q(x t , a t , {circumflex over (τ)}) where Q(x t , a t , τ) defines the state action value for the observation x t in the tuple and for the action a t in the tuple, and where Z denotes the return for the observation x t in the tuple and for the action a t. 5 . The method of claim 4 , wherein determining an estimate of the quantile level, {circumflex over (τ)}, for which Z=Q(x t , a t , {circumflex over (τ)}) comprises determining {circumflex over (τ)} from a linear interpolation between quantile levels of the state action value distribution to either side of the return based on the reward in the tuple. 6 . The method of claim 1 , wherein the return is based on a discounted sum of the reward in the tuple and the rewards in a succession of subsequent tuples. 7 . The method of claim 1 , wherein the state action value distribution neural network comprises a state action value quantile neural network that is configured to generate an output value for each of the plurality of quantile levels of the state action value distribution. 8 . The method of claim 1 , further comprising: training the state action value distribution neural network using the tuples in the training data and based on a quantile regression loss that defines a quantile regression target for each quantile level based on a difference between the reward in the tuple and the state action value defined by the state action value distribution for the quantile level and for the action in the tuple. 9 . The method of claim 1 , wherein generating the state action value distribution comprises: summing a value estimate for each of the quantile levels of the state action value distribution and an advantage estimate for each of the quantile levels of the state action value distribution. 10 . The method of claim 9 , wherein the state action value distribution generated by the state action value distribution neural network comprises a distribution of advantage values that defines the advantage estimate for each of the plurality of quantile levels of the state action value distribution. 11 . The method of claim 1 , wherein generating the state action value distribution comprises: generating a state action value for a first of the plurality of quantile levels in an ordered sequence of the quantile levels, and generating state action difference values for subsequent ones of the plurality of quantile levels. 12 . The method of claim 1 , wherein obtaining training data comprises: maintaining a buffer memory storing the tuples; and adding tuples into the buffer memory based on observations of the environment, selected actions, and rewards obtained as the agent is controlled to take actions in the environment to perform the task. 13 . The method of claim 1 , further comprising training the state action value distribution neural network in hindsight, wherein the training in hindsight comprises: training the state action value distribution neural network using the tuple for a time step and a future trajectory from the time step, wherein the future trajectory from the time step is defined by the observations, actions, and rewards in and one or more tuples for time steps subsequent to the time step. 14 . The method of claim 1 , further comprising: maintaining a quantile predictor neural network configured to process the observation in a tuple for a current time step and data in one or more tuples for subsequent time steps to generate a quantile prediction that predicts the identified quantile for the current time step; and wherein identifying a quantile level of the state action value distribution that is closest to a return based on the reward in the tuple comprises: processing the observation in the tuple and data from one or more subsequent tuples using the quantile predictor neural network, to generate the quantile prediction; and determining the identified quantile level using the quantile prediction. 15 . The method of claim 1 , comprising: training the action selection neural network using a policy gradient reinforcement learning technique by updating parameters of the action selection neural network based on a product of the training target and gradient of a logarithm of the action selection output. 16 . One or more computer-reada

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Learning methods · CPC title

  • Combinations of networks · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024256883A1 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a neural network used to select actions to be performed by an agent interacting with an environment. Implementations of the system can take into account a level of luck in the environment, and hence whilst learning can account for outcomes that were caused by external factors as well as …
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 01 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).