Modulating agent behavior to optimize learning progress

US12061964B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12061964-B2
Application numberUS-202017032562-A
CountryUS
Kind codeB2
Filing dateSep 25, 2020
Priority dateSep 25, 2019
Publication dateAug 13, 2024
Grant dateAug 13, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in a set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more data processing apparatus for selecting actions to be performed by an agent interacting with an environment to accomplish a task, wherein the method comprises repeatedly performing operations including: sampling a behavior modulation from a set of possible behavior modulations in accordance with a current probability distribution over the set of possible behavior modulations, wherein: the behavior modulation includes a respective value for each of one or more modulation factors; and the behavior modulation defines an exploration policy for modifying a set of action scores that includes a respective action score for each action in a set of possible actions that can be performed by the agent to interact with the environment; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in the set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation that characterizes an estimated progress in improving a performance of the agent in accomplishing the task that would result from training the action selection neural network on training data characterizing interaction of the agent with the environment over the one or more time steps where the action scores were modified using the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation. 2. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies a respective bias value for each action in the set of possible actions, and modifying the action scores using the sampled behavior modulation comprises adding the corresponding bias value to each action score. 3. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies a temperature parameter, and modifying the action scores using the behavior modulation factor comprises dividing each action score by the temperature parameter. 4. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies an ϵ probability value, and modifying the action scores using the behavior modulation comprises modifying the action scores such that, with probability given by the ϵ probability value, each action has an equal likelihood of being selected to be performed by the agent. 5. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies an action-repeat probability value, and modifying the action scores using the behavior modulation factor comprises modifying the action scores such that, with probability given by the action-repeat probability value, the action selected to be performed by the agent at the time step is the action performed by the agent at a previous time step. 6. The method of claim 1 , wherein: the action score for each action is specified by data characterizing a probability distribution over possible returns that would result from the agent performing the action; wherein the one or more behavior modulation factors include a behavior modulation factor that specifies one or more parameters of a distortion function; and modifying the action scores using the behavior modulation factor comprises, for each action, applying the distortion function to the data characterizing the probability distribution over possible returns that would result from the agent performing the action. 7. The method of claim 1 , wherein the progress in improving the performance of the agent in accomplishing the task characterizes an expected difference between a cumulative measure of rewards received by the agent as a result of interacting with the environment by performing actions selected using the action selection neural network: (i) after the action selection neural network is trained on the training data characterizing the interaction of the agent with the environment over the one or more time steps where the action scores were modified using the behavior modulation, and (ii) before the action selection neural network is trained on the training data characterizing the interaction of the agent with the environment over the one or more time steps where the action scores were modified using the behavior modulation. 8. The method of claim 1 , wherein determining the fitness measure comprises determining the fitness measure based on a reward received by the agent at each of the one or more time steps. 9. The method of claim 8 , wherein the fitness measure is determined as a sum of the rewards received by the agent at each of the one or more time steps. 10. The method of claim 1 , wherein updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation comprises: determining a measure of central tendency of a plurality of given fitness measures that have previously been determined; and determining an updated probability distribution based at least in part on whether the fitness measure corresponding to the behavior modulation exceeds the measure of central tendency of the plurality of given fitness measures. 11. The method of claim 10 , wherein the measure of central tendency of the plurality of given fitness measures is a mean of the plurality of given fitness measures. 12. The method of claim 10 , further comprising adjusting a length of a time window specifying which given fitness measures are eligible for inclusion in the plurality of given fitness measures. 13. The method of claim 1 , further comprising updating current values of parameters of the action selection neural network using reinforcement learning techniques. 14. The method of claim 1 , wherein selecting the action to be performed by the agent at the time step based on the modified action scores comprises: sampling the action to be performed by the agent at the time step based on a probability distribution over the set of possible actions that is specified by the modified action scores. 15. A system for selecting actions to be performed by an agent interacting with an environment to accomplish a task, the system comprising one or more computers and one or more storage devices that when executed by the one or more computers cause the one or more computers to repeatedly perform operations comprising: sampling a behavior modulation from a set of possible behavior modulations in accordance with a current probability distribution over the set of possible behavior modulations, wherein: the behavior modulation includes a respective value for each of one or more modulation factors; and the behavior modulation defines an exploration policy for modifying a set of action scores that includes a respective action score for each action in a set of possible actions that can be performed by the agent to interact with the environment; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network t

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Reinforcement learning · CPC title

  • Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12061964B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an act…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/006. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).