Selecting actions to be performed by a reinforcement learning agent using tree search
US-2018032864-A1 · Feb 1, 2018 · US
US11651208B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11651208-B2 |
| Application number | US-201816615042-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 22, 2018 |
| Priority date | May 19, 2017 |
| Publication date | May 16, 2023 |
| Grant date | May 16, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning. A reinforcement learning neural network selects actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result. The reinforcement learning neural network has at least one input to receive an input observation characterizing a state of the environment and at least one output for determining an action to be performed by the agent in response to the input observation. The system includes a reward function network coupled to the reinforcement learning neural network. The reward function network has an input to receive reward data characterizing a reward provided by one or more states of the environment and is configured to determine a reward function to provide one or more target values for training the reinforcement learning neural network.
Opening claim text (preview).
What is claimed is: 1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for reinforcement learning, the operations comprising: at each of a sequence of time steps, controlling an agent using a reinforcement learning neural network to select an action to be performed by the agent at the time step to interact with an environment to perform a task given a state of the environment at the time step, the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment and at least one output for determining (i) an action to be performed by the agent in response to the input observation and (ii) a value estimate for the state characterized by the input observation; for each time step, receiving, as a result of the agent performing the action at the time step, a respective reward value; for each particular time step in the sequence of time steps, generating a respective target value for training the reinforcement learning neural network using a reward function network that is configured to map (i) the respective rewards at each of a plurality of time steps after the particular time step in the sequence and (ii) the respective value estimates generated at each of the plurality of time steps after the particular time step to a target value for the particular time step in accordance with one or more learned parameters of the reward function network; and training the reinforcement learning neural network using the respective target values for the particular time steps determined using the reward function network. 2. A system as claimed in claim 1 wherein the target network computes the target value as a weighted sum of reward predictions for n future time steps, where n is an integer greater than 1. 3. A system as claimed in claim 2 wherein the weighted sum comprises an exponentially weighted sum with decay parameter λ such that a target value g t for time step t satisfies: g t =r t+1 +γ(1−λ) v ( s t+1 ,θ t )+γλ g t+1 where γ is a discount factor in a range [0,1] and v(s t+1 , θ t ) is a value estimate for state s t+1 at time step t+1 as determined by value function parameters θ t of the reinforcement learning neural network, and g t+1 is the weighted sum for time step t+1. 4. A system as claimed in claim 3 wherein the one or more learnable parameters of the reward function network define a respective value for λ for each of the plurality of time steps. 5. A system as claimed in claim 4 wherein λ is a function of the time step. 6. A system as claimed in claim 1 wherein the reward function network includes or more learnable parameters to determine a λ-value; and wherein the target values for the time steps are dependent upon the λ-value. 7. A system as claimed in claim 6 wherein the reward function network includes a λ-network coupled to the reinforcement learning neural network to determine a respective λ-value for each time step from a state of the reinforcement learning neural network at the time step. 8. A system as claimed in claim 1 further comprising: generating, using a reward function target generator, reward function targets for training the one or more learnable parameters of the reward function network. 9. A system as claimed in claim 8 wherein the reward function targets comprise alternate λ-return values generated independently of the target values for the time steps from the reward function network. 10. A system as claimed in claim 9 wherein the reward function target generator is configured to perform an alternate rollout from a state of the environment at a first time step of the plurality of time steps to determine the reward function targets. 11. A system as claimed in claim 9 wherein the reward function target generator is configured to retrieve stored target values that are stored in a memory to provide the reward function targets. 12. A system as claimed in claim 1 wherein the reinforcement learning neural network comprises a recurrent neural network to provide a representation of a sequence of states of the environment comprising a sequence of state-dependent values, and wherein the reward function network is configured to generate the target values for the time steps from the sequence of state-dependent values. 13. A system as claimed in claim 12 wherein the reward function network has an input to receive state-dependent reward value data for the sequence of states of the environment. 14. A system as claimed in claim 12 including episodic memory to store state and reward data from previous states of the system, and wherein the reward function network is configured to receive reward data from the episodic memory. 15. A method performed by one or more computers, the method comprising: at each of a sequence of time steps, controlling an agent using a reinforcement learning neural network to select an action to be performed by the agent at the time step to interact with an environment to perform a task given a state of the environment at the time step, the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment and at least one output for determining (i) an action to be performed by the agent in response to the input observation and (ii) a value estimate for the state characterized by the input observation; for each time step, receiving, as a result of the agent performing the action at the time step, a respective reward value; for each particular time step in the sequence of time steps, generating a respective target value for training the reinforcement learning neural network using a reward function network that is configured to map (i) the respective rewards at each of a plurality of time steps after the particular time step in the sequence and (ii) the respective value estimates generated at each of the plurality of time steps after the particular time step to a target value for the particular time step in accordance with one or more learned parameters of the reward function network; and training the reinforcement learning neural network using the respective target values for the particular time steps determined using the reward function network. 16. A system as claimed in claim 1 wherein the target network computes the target value as a weighted sum of reward predictions for n future time steps, where n is an integer greater than 1. 17. A system as claimed in claim 2 wherein the weighted sum comprises an exponentially weighted sum with decay parameter λ such that a target value g t for time step t satisfies: g t =r t+1 +γ(1+λ) v ( s t+1 ,θ t )+γλ g t+1 where γ is a discount factor in a range [0,1] and v(s t+1 , θ t ) is a value estimate for state s t+1 at time step t+1 as determined by value function parameters θ t of the reinforcement learning neural network, and g t+1 is the weighted sum for time step t+1. 18. A system as claimed in claim 3 wherein the one or more learnable parameters of the reward function network define a respective value for λ for each of the plurality of time steps. 19. A system as claimed in claim 4 wherein λ is a function of the time step. 20. A system as claimed in claim 1 wherein the reward function network includes or more learnable parameters to determine a λ-value; and wherein the target values for the time steps are
Recurrent networks, e.g. Hopfield networks · CPC title
Architecture, e.g. interconnection topology · CPC title
Reinforcement learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.