Method and apparatus for providing real-time monitoring of an artifical neural network
US-2015106316-A1 · Apr 16, 2015 · US
US11449750B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11449750-B2 |
| Application number | US-201816617478-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 28, 2018 |
| Priority date | May 26, 2017 |
| Publication date | Sep 20, 2022 |
| Grant date | Sep 20, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the observation for use in updating the current values of the network parameters.
Opening claim text (preview).
What is claimed is: 1. A method of training a neural network having a plurality of network parameters, wherein the neural network is used to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the neural network is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the method comprises: receiving a current observation characterizing a current state of the environment; determining a target network output for the current observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the current observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the current observation for use in updating the current values of the network parameters. 2. The method of claim 1 , wherein the action selection output defines a probability distribution over possible actions to be performed by the agent. 3. The method of claim 1 , wherein the action selection output comprises a respective Q value for each of a plurality of possible actions that represents an expected return to be received if the agent performs the possible action in response to the observation. 4. The method of claim 1 , wherein the action selection output identifies an optimal action to be performed by the agent in response to the observation. 5. The method of claim 1 , wherein the network output further comprises a predicted expected return output that is an estimate of a return resulting from the environment being in the state, and wherein determining the target network output comprises: determining a target return based on evaluating a progress of the task as of a terminal state of a current episode of interaction. 6. The method of claim 5 , wherein the return is dependent on whether the specified result is achieved as of the terminal state. 7. The method of claim 1 , wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state. 8. The method of claim 7 , wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree. 9. The method claim 7 , wherein performing the look ahead search comprises evaluating leaf nodes of the state tree encountered during the look ahead search using the neural network and in accordance with current values of the network parameters. 10. The method of claim 1 , further comprising: obtaining, from the exploration history store, a training observation and a training target network output associated with the training observation; processing the training observation using the neural network and in accordance with the current values of the network parameters to generate a training network output; determining a gradient with respect to the network parameters of an objective function that encourages the training network output to match the training target network output; and determining an update to the current values of the network parameters from the gradient. 11. The method of claim 10 , wherein the network output includes an action selection output that defines a probability distribution over possible actions to be performed by the agent and a predicted expected return output that is an estimate of a return resulting from the environment being in the state, and wherein the objective function is a weighted sum between (i) a difference between the probability distribution in the training target network output and the probability distribution in the training network output and (ii) a difference between the predicted expected return output in the training target network output and the predicted expected return output in the training network output. 12. A trained neural network system comprising a trained neural network having a plurality of trained network parameters and that is implemented by one or more computers, wherein the neural network system is configured to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the trained neural network is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the trained network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the neural network system comprises: an input to receive a current observation characterizing a current state of the environment; and an output for selecting an action to be performed by the agent in response to the current observation according to the action selection output; and wherein the neural network system is configured to provide the output for selecting the action by performing a look ahead search, wherein the look ahead search comprises a search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, and wherein the look ahead search is guided by the trained neural network in accordance with values of the network parameters. 13. A trained neural network system as claimed in claim 12 wherein the look ahead search is guided such that the search is dependent upon the action selection output from the trained neural network. 14. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for training a neural network having a plurality of network parameters, wherein the neural network is used to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, wherein the neural network is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and wherein the operations comprise: receiving a current observation characterizing a current state of the environment; determining a target network output for the current observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Probabilistic or stochastic networks · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Reinforcement learning · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.