Training action selection neural networks using look-ahead search
US-11449750-B2 · Sep 20, 2022 · US
US12147899B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12147899-B2 |
| Application number | US-202318528640-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 4, 2023 |
| Priority date | May 26, 2017 |
| Publication date | Nov 19, 2024 |
| Grant date | Nov 19, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the observation for use in updating the current values of the network parameters.
Opening claim text (preview).
What is claimed is: 1. A method of selecting, using a neural network, actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the neural network has a plurality of network parameters and is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the method comprises: receiving a current observation characterizing a current state of the environment; determining a target action selection output for the current observation by performing, using the neural network and in accordance with current values of the network parameters, a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state, and wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree; and selecting an action to be performed by the agent in response to the current observation using the target action selection output generated by performing the look ahead search. 2. The method of claim 1 , wherein the action selection output defines a probability distribution over possible actions to be performed by the agent. 3. The method of claim 1 , wherein the action selection output comprises a respective Q value for each of a plurality of possible actions that represents an expected return to be received if the agent performs the possible action in response to the observation. 4. The method of claim 1 , wherein the action selection output identifies an optimal action to be performed by the agent in response to the observation. 5. The method of claim 1 , wherein performing the look ahead search comprises evaluating leaf nodes of the state tree encountered during the look ahead search using the neural network and in accordance with current values of the network parameters. 6. The method of claim 5 , wherein evaluating leaf nodes of the state tree encountered during the look ahead search using the trained neural network and in accordance with current values of the network parameters comprises, for each leaf node: adding one or more new edges from the leaf node of the state tree; processing a new observation characterizing a new state of the environment that is characterized by the leaf node using the trained neural network and in accordance with the current values of the network parameters to generate a new action selection output; and generating, using the new action selection output, a respective prior probability for each new edge. 7. The method of claim 1 , wherein the current values of the network parameters have been determined by training the neural network using target network outputs determined by performing look ahead searches using the neural network. 8. The method of claim 1 , wherein performing the look ahead search comprises determining a respective visit count for each of a plurality of outgoing edges from the root node, each outgoing edge representing a respective action to be performed by the agent. 9. The method of claim 8 , wherein the target action selection output comprises a respective probability for each action that is represented by an outgoing edge from the root node, and wherein determining the target action selection output comprises determining the target action selection output from the respective visit counts for the outgoing edges. 10. The method of claim 1 , wherein performing the look ahead search comprises traversing the state tree starting from the root node until encountering a leaf node by selecting edges to be traversed using adjusted action scores for edges in the state tree. 11. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for selecting, using a trained neural network, actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the trained neural network has a plurality of network parameters and is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the operations comprise: receiving a current observation characterizing a current state of the environment; determining a target action selection output for the current observation by performing, using the neural network and in accordance with current values of the network parameters, a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state, and wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree; and selecting an action to be performed by the agent in response to the current observation using the target action selection output generated by performing the look ahead search. 12. The system of claim 11 , wherein the action selection output defines a probability distribution over possible actions to be performed by the agent. 13. The system of claim 11 , wherein the action selection output comprises a respective Q value for each of a plurality of possible actions that represents an expected return to be received if the agent performs the possible action in response to the observation. 14. The system of claim 11 , wherein the action selection output identifies an optimal action to be performed by the agent in response to the observation. 15. The system of claim 11 , wherein performing the look ahead search comprises evaluating leaf nodes of the state tree encountered during the look ahead search using the neural network and in accordance with current values of the network parameters. 16. The system of claim 15 , wherein evaluating leaf nodes of the state tree encountered during the look ahead search using the trained neural network and in accordance with current values of the network parameters comprises, for each leaf node: adding one or more new edges from the leaf node of the state tree; processing a new observation characterizing a new state of the environment that is characterized by the leaf node using the trained neural network and in accordance with the current values of the network parameters to generate a new action selection output; and generating, using the new action selection output, a respective prior probability for each new edge.
Reinforcement learning · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Probabilistic or stochastic networks · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.