What technology area does this patent fall under?

Primary CPC classification G06N3/092. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 16 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Training action selection neural networks using a differentiable credit function

US11651208B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11651208-B2
Application number	US-201816615042-A
Country	US
Kind code	B2
Filing date	May 22, 2018
Priority date	May 19, 2017
Publication date	May 16, 2023
Grant date	May 16, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning. A reinforcement learning neural network selects actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result. The reinforcement learning neural network has at least one input to receive an input observation characterizing a state of the environment and at least one output for determining an action to be performed by the agent in response to the input observation. The system includes a reward function network coupled to the reinforcement learning neural network. The reward function network has an input to receive reward data characterizing a reward provided by one or more states of the environment and is configured to determine a reward function to provide one or more target values for training the reinforcement learning neural network.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for reinforcement learning, the operations comprising: at each of a sequence of time steps, controlling an agent using a reinforcement learning neural network to select an action to be performed by the agent at the time step to interact with an environment to perform a task given a state of the environment at the time step, the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment and at least one output for determining (i) an action to be performed by the agent in response to the input observation and (ii) a value estimate for the state characterized by the input observation; for each time step, receiving, as a result of the agent performing the action at the time step, a respective reward value; for each particular time step in the sequence of time steps, generating a respective target value for training the reinforcement learning neural network using a reward function network that is configured to map (i) the respective rewards at each of a plurality of time steps after the particular time step in the sequence and (ii) the respective value estimates generated at each of the plurality of time steps after the particular time step to a target value for the particular time step in accordance with one or more learned parameters of the reward function network; and training the reinforcement learning neural network using the respective target values for the particular time steps determined using the reward function network. 2. A system as claimed in claim 1 wherein the target network computes the target value as a weighted sum of reward predictions for n future time steps, where n is an integer greater than 1. 3. A system as claimed in claim 2 wherein the weighted sum comprises an exponentially weighted sum with decay parameter λ such that a target value g t for time step t satisfies: g t =r t+1 +γ(1−λ) v ( s t+1 ,θ t )+γλ g t+1 where γ is a discount factor in a range [0,1] and v(s t+1 , θ t ) is a value estimate for state s t+1 at time step t+1 as determined by value function parameters θ t of the reinforcement learning neural network, and g t+1 is the weighted sum for time step t+1. 4. A system as claimed in claim 3 wherein the one or more learnable parameters of the reward function network define a respective value for λ for each of the plurality of time steps. 5. A system as claimed in claim 4 wherein λ is a function of the time step. 6. A system as claimed in claim 1 wherein the reward function network includes or more learnable parameters to determine a λ-value; and wherein the target values for the time steps are dependent upon the λ-value. 7. A system as claimed in claim 6 wherein the reward function network includes a λ-network coupled to the reinforcement learning neural network to determine a respective λ-value for each time step from a state of the reinforcement learning neural network at the time step. 8. A system as claimed in claim 1 further comprising: generating, using a reward function target generator, reward function targets for training the one or more learnable parameters of the reward function network. 9. A system as claimed in claim 8 wherein the reward function targets comprise alternate λ-return values generated independently of the target values for the time steps from the reward function network. 10. A system as claimed in claim 9 wherein the reward function target generator is configured to perform an alternate rollout from a state of the environment at a first time step of the plurality of time steps to determine the reward function targets. 11. A system as claimed in claim 9 wherein the reward function target generator is configured to retrieve stored target values that are stored in a memory to provide the reward function targets. 12. A system as claimed in claim 1 wherein the reinforcement learning neural network comprises a recurrent neural network to provide a representation of a sequence of states of the environment comprising a sequence of state-dependent values, and wherein the reward function network is configured to generate the target values for the time steps from the sequence of state-dependent values. 13. A system as claimed in claim 12 wherein the reward function network has an input to receive state-dependent reward value data for the sequence of states of the environment. 14. A system as claimed in claim 12 including episodic memory to store state and reward data from previous states of the system, and wherein the reward function network is configured to receive reward data from the episodic memory. 15. A method performed by one or more computers, the method comprising: at each of a sequence of time steps, controlling an agent using a reinforcement learning neural network to select an action to be performed by the agent at the time step to interact with an environment to perform a task given a state of the environment at the time step, the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment and at least one output for determining (i) an action to be performed by the agent in response to the input observation and (ii) a value estimate for the state characterized by the input observation; for each time step, receiving, as a result of the agent performing the action at the time step, a respective reward value; for each particular time step in the sequence of time steps, generating a respective target value for training the reinforcement learning neural network using a reward function network that is configured to map (i) the respective rewards at each of a plurality of time steps after the particular time step in the sequence and (ii) the respective value estimates generated at each of the plurality of time steps after the particular time step to a target value for the particular time step in accordance with one or more learned parameters of the reward function network; and training the reinforcement learning neural network using the respective target values for the particular time steps determined using the reward function network. 16. A system as claimed in claim 1 wherein the target network computes the target value as a weighted sum of reward predictions for n future time steps, where n is an integer greater than 1. 17. A system as claimed in claim 2 wherein the weighted sum comprises an exponentially weighted sum with decay parameter λ such that a target value g t for time step t satisfies: g t =r t+1 +γ(1+λ) v ( s t+1 ,θ t )+γλ g t+1 where γ is a discount factor in a range [0,1] and v(s t+1 , θ t ) is a value estimate for state s t+1 at time step t+1 as determined by value function parameters θ t of the reinforcement learning neural network, and g t+1 is the weighted sum for time step t+1. 18. A system as claimed in claim 3 wherein the one or more learnable parameters of the reward function network define a respective value for λ for each of the plurality of time steps. 19. A system as claimed in claim 4 wherein λ is a function of the time step. 20. A system as claimed in claim 1 wherein the reward function network includes or more learnable parameters to determine a λ-value; and wherein the target values for the time steps are

Assignees

Deepmind Tech Ltd

Inventors

Classifications

G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/04
Architecture, e.g. interconnection topology · CPC title
G06N3/092Primary
Reinforcement learning · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

Patent family

Related publications grouped by family.

View patent family 62217992

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11651208B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning. A reinforcement learning neural network selects actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result. The reinforcement learning neural network has at least one input to receive an input observati…
Who is the assignee on this patent?: Deepmind Tech Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 16 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).