Training action selection neural networks using look-ahead search

US12147899B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12147899-B2
Application numberUS-202318528640-A
CountryUS
Kind codeB2
Filing dateDec 4, 2023
Priority dateMay 26, 2017
Publication dateNov 19, 2024
Grant dateNov 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the observation for use in updating the current values of the network parameters.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of selecting, using a neural network, actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the neural network has a plurality of network parameters and is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the method comprises: receiving a current observation characterizing a current state of the environment; determining a target action selection output for the current observation by performing, using the neural network and in accordance with current values of the network parameters, a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state, and wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree; and selecting an action to be performed by the agent in response to the current observation using the target action selection output generated by performing the look ahead search. 2. The method of claim 1 , wherein the action selection output defines a probability distribution over possible actions to be performed by the agent. 3. The method of claim 1 , wherein the action selection output comprises a respective Q value for each of a plurality of possible actions that represents an expected return to be received if the agent performs the possible action in response to the observation. 4. The method of claim 1 , wherein the action selection output identifies an optimal action to be performed by the agent in response to the observation. 5. The method of claim 1 , wherein performing the look ahead search comprises evaluating leaf nodes of the state tree encountered during the look ahead search using the neural network and in accordance with current values of the network parameters. 6. The method of claim 5 , wherein evaluating leaf nodes of the state tree encountered during the look ahead search using the trained neural network and in accordance with current values of the network parameters comprises, for each leaf node: adding one or more new edges from the leaf node of the state tree; processing a new observation characterizing a new state of the environment that is characterized by the leaf node using the trained neural network and in accordance with the current values of the network parameters to generate a new action selection output; and generating, using the new action selection output, a respective prior probability for each new edge. 7. The method of claim 1 , wherein the current values of the network parameters have been determined by training the neural network using target network outputs determined by performing look ahead searches using the neural network. 8. The method of claim 1 , wherein performing the look ahead search comprises determining a respective visit count for each of a plurality of outgoing edges from the root node, each outgoing edge representing a respective action to be performed by the agent. 9. The method of claim 8 , wherein the target action selection output comprises a respective probability for each action that is represented by an outgoing edge from the root node, and wherein determining the target action selection output comprises determining the target action selection output from the respective visit counts for the outgoing edges. 10. The method of claim 1 , wherein performing the look ahead search comprises traversing the state tree starting from the root node until encountering a leaf node by selecting edges to be traversed using adjusted action scores for edges in the state tree. 11. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for selecting, using a trained neural network, actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the trained neural network has a plurality of network parameters and is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the operations comprise: receiving a current observation characterizing a current state of the environment; determining a target action selection output for the current observation by performing, using the neural network and in accordance with current values of the network parameters, a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state, and wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree; and selecting an action to be performed by the agent in response to the current observation using the target action selection output generated by performing the look ahead search. 12. The system of claim 11 , wherein the action selection output defines a probability distribution over possible actions to be performed by the agent. 13. The system of claim 11 , wherein the action selection output comprises a respective Q value for each of a plurality of possible actions that represents an expected return to be received if the agent performs the possible action in response to the observation. 14. The system of claim 11 , wherein the action selection output identifies an optimal action to be performed by the agent in response to the observation. 15. The system of claim 11 , wherein performing the look ahead search comprises evaluating leaf nodes of the state tree encountered during the look ahead search using the neural network and in accordance with current values of the network parameters. 16. The system of claim 15 , wherein evaluating leaf nodes of the state tree encountered during the look ahead search using the trained neural network and in accordance with current values of the network parameters comprises, for each leaf node: adding one or more new edges from the leaf node of the state tree; processing a new observation characterizing a new state of the environment that is characterized by the leaf node using the trained neural network and in accordance with the current values of the network parameters to generate a new action selection output; and generating, using the new action selection output, a respective prior probability for each new edge.

Assignees

Inventors

Classifications

  • Reinforcement learning · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Probabilistic or stochastic networks · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12147899B2 cover?
Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from …
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).