What technology area does this patent fall under?

Primary CPC classification G06N3/006. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Modulating agent behavior to optimize learning progress

US12061964B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12061964-B2
Application number	US-202017032562-A
Country	US
Kind code	B2
Filing date	Sep 25, 2020
Priority date	Sep 25, 2019
Publication date	Aug 13, 2024
Grant date	Aug 13, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in a set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more data processing apparatus for selecting actions to be performed by an agent interacting with an environment to accomplish a task, wherein the method comprises repeatedly performing operations including: sampling a behavior modulation from a set of possible behavior modulations in accordance with a current probability distribution over the set of possible behavior modulations, wherein: the behavior modulation includes a respective value for each of one or more modulation factors; and the behavior modulation defines an exploration policy for modifying a set of action scores that includes a respective action score for each action in a set of possible actions that can be performed by the agent to interact with the environment; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in the set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation that characterizes an estimated progress in improving a performance of the agent in accomplishing the task that would result from training the action selection neural network on training data characterizing interaction of the agent with the environment over the one or more time steps where the action scores were modified using the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation. 2. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies a respective bias value for each action in the set of possible actions, and modifying the action scores using the sampled behavior modulation comprises adding the corresponding bias value to each action score. 3. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies a temperature parameter, and modifying the action scores using the behavior modulation factor comprises dividing each action score by the temperature parameter. 4. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies an ϵ probability value, and modifying the action scores using the behavior modulation comprises modifying the action scores such that, with probability given by the ϵ probability value, each action has an equal likelihood of being selected to be performed by the agent. 5. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies an action-repeat probability value, and modifying the action scores using the behavior modulation factor comprises modifying the action scores such that, with probability given by the action-repeat probability value, the action selected to be performed by the agent at the time step is the action performed by the agent at a previous time step. 6. The method of claim 1 , wherein: the action score for each action is specified by data characterizing a probability distribution over possible returns that would result from the agent performing the action; wherein the one or more behavior modulation factors include a behavior modulation factor that specifies one or more parameters of a distortion function; and modifying the action scores using the behavior modulation factor comprises, for each action, applying the distortion function to the data characterizing the probability distribution over possible returns that would result from the agent performing the action. 7. The method of claim 1 , wherein the progress in improving the performance of the agent in accomplishing the task characterizes an expected difference between a cumulative measure of rewards received by the agent as a result of interacting with the environment by performing actions selected using the action selection neural network: (i) after the action selection neural network is trained on the training data characterizing the interaction of the agent with the environment over the one or more time steps where the action scores were modified using the behavior modulation, and (ii) before the action selection neural network is trained on the training data characterizing the interaction of the agent with the environment over the one or more time steps where the action scores were modified using the behavior modulation. 8. The method of claim 1 , wherein determining the fitness measure comprises determining the fitness measure based on a reward received by the agent at each of the one or more time steps. 9. The method of claim 8 , wherein the fitness measure is determined as a sum of the rewards received by the agent at each of the one or more time steps. 10. The method of claim 1 , wherein updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation comprises: determining a measure of central tendency of a plurality of given fitness measures that have previously been determined; and determining an updated probability distribution based at least in part on whether the fitness measure corresponding to the behavior modulation exceeds the measure of central tendency of the plurality of given fitness measures. 11. The method of claim 10 , wherein the measure of central tendency of the plurality of given fitness measures is a mean of the plurality of given fitness measures. 12. The method of claim 10 , further comprising adjusting a length of a time window specifying which given fitness measures are eligible for inclusion in the plurality of given fitness measures. 13. The method of claim 1 , further comprising updating current values of parameters of the action selection neural network using reinforcement learning techniques. 14. The method of claim 1 , wherein selecting the action to be performed by the agent at the time step based on the modified action scores comprises: sampling the action to be performed by the agent at the time step based on a probability distribution over the set of possible actions that is specified by the modified action scores. 15. A system for selecting actions to be performed by an agent interacting with an environment to accomplish a task, the system comprising one or more computers and one or more storage devices that when executed by the one or more computers cause the one or more computers to repeatedly perform operations comprising: sampling a behavior modulation from a set of possible behavior modulations in accordance with a current probability distribution over the set of possible behavior modulations, wherein: the behavior modulation includes a respective value for each of one or more modulation factors; and the behavior modulation defines an exploration policy for modifying a set of action scores that includes a respective action score for each action in a set of possible actions that can be performed by the agent to interact with the environment; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network t

Assignees

Deepmind Tech Ltd

Inventors

Classifications

G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/092
Reinforcement learning · CPC title
G06V40/20
Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title
G06V10/82
using neural networks · CPC title
G06V10/764
using classification, e.g. of video objects · CPC title

Patent family

Related publications grouped by family.

View patent family 74880995

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12061964B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an act…
Who is the assignee on this patent?: Deepmind Tech Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/006. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Intelligent compute resource selection for machine learning training jobs

Method and apparatus for obtaining training sample of first model based on second model

Multi-task neural networks with task-specific paths

Method for adaptive exploration to accelerate deep reinforcement learning

Method of performing multi-modal dialogue between a humanoid robot and user, computer program product and humanoid robot for implementing said method

Cooperative neural network reinforcement learning

Learning apparatus and method for learning a model corresponding to a function changing in time series

Frequently asked questions