Training action-selection neural networks from demonstrations using multiple losses
US-11604941-B1 · Mar 14, 2023 · US
US12299574B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12299574-B2 |
| Application number | US-202318487428-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 16, 2023 |
| Priority date | Feb 5, 2018 |
| Publication date | May 13, 2025 |
| Grant date | May 13, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection neural network used to select actions to be performed by an agent interacting with an environment. In one aspect, a system comprises a plurality of actor computing units and a plurality of learner computing units. The actor computing units generate experience tuple trajectories that are used by the learner computing units to update learner action selection neural network parameters using a reinforcement learning technique. The reinforcement learning technique may be an off-policy actor critic reinforcement learning technique.
Opening claim text (preview).
The invention claimed is: 1. A method performed by one or more computers for training an action selection neural network used to select actions to be performed by an agent interacting with an environment, the method comprising: generating a plurality of trajectories of experience tuples, wherein: each trajectory of experience tuples characterizes interaction of a respective instance of the agent with a respective instance of the environment over a sequence of time steps while the respective instance of the agent performs actions selected in accordance with a respective action selection policy; and each trajectory of experience tuples comprises a sequence of experience tuples with each experience tuple corresponding to a respective time step and comprising a respective observation characterizing a state of the respective instance of the environment at the time step; selecting a batch of trajectories of experience tuples from the plurality of trajectories of experience tuples; and training the action selection neural network on the batch of trajectories of experience tuples, comprising: processing the observations included in the experience tuples in the trajectories in the batch using the action selection neural network to generate, for each observation in each experience tuple, a respective action selection output, comprising: processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network; and updating values of a set of parameters of the action selection neural network using the action selection outputs generated by the action selection neural network. 2. The method of claim 1 , wherein the action selection neural network includes a convolutional block comprising one or more convolutional neural network layers; and wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network comprises: processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network to generate a respective convolutional block output for each observation. 3. The method of claim 2 , wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network comprises: folding a time dimension of the observations into a batch dimension of the observations prior to processing the observations using the convolutional block of the action selection neural network. 4. The method of claim 3 , wherein the action selection neural network further includes a recurrent block comprising one or more recurrent neural network layers; and wherein processing the observations included in the experience tuples in the trajectories in the batch using the action selection neural network comprises: processing a respective convolutional block output for each observation included in each experience tuple in each trajectory using the recurrent block to generate a respective recurrent block output for each observation. 5. The method of claim 4 , wherein the action selection neural network further includes a fully connected block comprising one or more fully connected neural network layers; and wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network comprises: processing a respective recurrent block output for each observation included in each experience tuple in each trajectory in the batch in parallel using the fully connected block of the action selection neural network to generate the respective action selection output for each observation. 6. The method of claim 5 , wherein processing a respective recurrent block output for each observation included in each experience tuple in each trajectory in the batch in parallel using the fully connected block of the action selection neural network comprises: folding a time dimension of the recurrent block outputs into a batch dimension of the recurrent block outputs prior to processing the recurrent block outputs using the fully connected block of the action selection neural network. 7. The method of claim 1 , wherein each trajectory of experience tuples of the plurality of trajectories of experience tuples is generated by a respective actor computing unit of a plurality of actor computing units; wherein the plurality of actor computing units operate in parallel to generate the plurality of experience tuple trajectories. 8. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations computers for training an action selection neural network used to select actions to be performed by an agent interacting with an environment, the operations comprising: generating a plurality of trajectories of experience tuples, wherein: each trajectory of experience tuples characterizes interaction of a respective instance of the agent with a respective instance of the environment over a sequence of time steps while the respective instance of the agent performs actions selected in accordance with a respective action selection policy; and each trajectory of experience tuples comprises a sequence of experience tuples with each experience tuple corresponding to a respective time step and comprising a respective observation characterizing a state of the respective instance of the environment at the time step; selecting a batch of trajectories of experience tuples from the plurality of trajectories of experience tuples; and training the action selection neural network on the batch of trajectories of experience tuples, comprising: processing the observations included in the experience tuples in the trajectories in the batch using the action selection neural network to generate, for each observation in each experience tuple, a respective action selection output, comprising: processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network; and updating values of a set of parameters of the action selection neural network using the action selection outputs generated by the action selection neural network. 9. The system of claim 8 , wherein the action selection neural network includes a convolutional block comprising one or more convolutional neural network layers; and wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network comprises: processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network to generate a respective convolutional block output for each observation. 10. The system of claim 9 , wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network comprises: folding a time dimension of the observations into a batch dimension of the observations prior to processing the observations using the convolutional block of the acti
Combinations of networks · CPC title
Reinforcement learning · CPC title
Distributed learning, e.g. federated learning · CPC title
using electronic means · CPC title
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.