Training action selection neural networks using look-ahead search

US11449750B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11449750-B2
Application numberUS-201816617478-A
CountryUS
Kind codeB2
Filing dateMay 28, 2018
Priority dateMay 26, 2017
Publication dateSep 20, 2022
Grant dateSep 20, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the observation for use in updating the current values of the network parameters.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a neural network having a plurality of network parameters, wherein the neural network is used to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the neural network is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the method comprises: receiving a current observation characterizing a current state of the environment; determining a target network output for the current observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the current observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the current observation for use in updating the current values of the network parameters. 2. The method of claim 1 , wherein the action selection output defines a probability distribution over possible actions to be performed by the agent. 3. The method of claim 1 , wherein the action selection output comprises a respective Q value for each of a plurality of possible actions that represents an expected return to be received if the agent performs the possible action in response to the observation. 4. The method of claim 1 , wherein the action selection output identifies an optimal action to be performed by the agent in response to the observation. 5. The method of claim 1 , wherein the network output further comprises a predicted expected return output that is an estimate of a return resulting from the environment being in the state, and wherein determining the target network output comprises: determining a target return based on evaluating a progress of the task as of a terminal state of a current episode of interaction. 6. The method of claim 5 , wherein the return is dependent on whether the specified result is achieved as of the terminal state. 7. The method of claim 1 , wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state. 8. The method of claim 7 , wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree. 9. The method claim 7 , wherein performing the look ahead search comprises evaluating leaf nodes of the state tree encountered during the look ahead search using the neural network and in accordance with current values of the network parameters. 10. The method of claim 1 , further comprising: obtaining, from the exploration history store, a training observation and a training target network output associated with the training observation; processing the training observation using the neural network and in accordance with the current values of the network parameters to generate a training network output; determining a gradient with respect to the network parameters of an objective function that encourages the training network output to match the training target network output; and determining an update to the current values of the network parameters from the gradient. 11. The method of claim 10 , wherein the network output includes an action selection output that defines a probability distribution over possible actions to be performed by the agent and a predicted expected return output that is an estimate of a return resulting from the environment being in the state, and wherein the objective function is a weighted sum between (i) a difference between the probability distribution in the training target network output and the probability distribution in the training network output and (ii) a difference between the predicted expected return output in the training target network output and the predicted expected return output in the training network output. 12. A trained neural network system comprising a trained neural network having a plurality of trained network parameters and that is implemented by one or more computers, wherein the neural network system is configured to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the trained neural network is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the trained network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the neural network system comprises: an input to receive a current observation characterizing a current state of the environment; and an output for selecting an action to be performed by the agent in response to the current observation according to the action selection output; and wherein the neural network system is configured to provide the output for selecting the action by performing a look ahead search, wherein the look ahead search comprises a search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, and wherein the look ahead search is guided by the trained neural network in accordance with values of the network parameters. 13. A trained neural network system as claimed in claim 12 wherein the look ahead search is guided such that the search is dependent upon the action selection output from the trained neural network. 14. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for training a neural network having a plurality of network parameters, wherein the neural network is used to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the neural network is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the operations comprise: receiving a current observation characterizing a current state of the environment; determining a target network output for the current observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural

Assignees

Inventors

Classifications

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Probabilistic or stochastic networks · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Reinforcement learning · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11449750B2 cover?
Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from …
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 20 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).