What technology area does this patent fall under?

Primary CPC classification G06N3/092. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Distributed training using actor-critic reinforcement learning with off-policy correction factors

US12299574B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12299574-B2
Application number	US-202318487428-A
Country	US
Kind code	B2
Filing date	Oct 16, 2023
Priority date	Feb 5, 2018
Publication date	May 13, 2025
Grant date	May 13, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection neural network used to select actions to be performed by an agent interacting with an environment. In one aspect, a system comprises a plurality of actor computing units and a plurality of learner computing units. The actor computing units generate experience tuple trajectories that are used by the learner computing units to update learner action selection neural network parameters using a reinforcement learning technique. The reinforcement learning technique may be an off-policy actor critic reinforcement learning technique.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method performed by one or more computers for training an action selection neural network used to select actions to be performed by an agent interacting with an environment, the method comprising: generating a plurality of trajectories of experience tuples, wherein: each trajectory of experience tuples characterizes interaction of a respective instance of the agent with a respective instance of the environment over a sequence of time steps while the respective instance of the agent performs actions selected in accordance with a respective action selection policy; and each trajectory of experience tuples comprises a sequence of experience tuples with each experience tuple corresponding to a respective time step and comprising a respective observation characterizing a state of the respective instance of the environment at the time step; selecting a batch of trajectories of experience tuples from the plurality of trajectories of experience tuples; and training the action selection neural network on the batch of trajectories of experience tuples, comprising: processing the observations included in the experience tuples in the trajectories in the batch using the action selection neural network to generate, for each observation in each experience tuple, a respective action selection output, comprising: processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network; and updating values of a set of parameters of the action selection neural network using the action selection outputs generated by the action selection neural network. 2. The method of claim 1 , wherein the action selection neural network includes a convolutional block comprising one or more convolutional neural network layers; and wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network comprises: processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network to generate a respective convolutional block output for each observation. 3. The method of claim 2 , wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network comprises: folding a time dimension of the observations into a batch dimension of the observations prior to processing the observations using the convolutional block of the action selection neural network. 4. The method of claim 3 , wherein the action selection neural network further includes a recurrent block comprising one or more recurrent neural network layers; and wherein processing the observations included in the experience tuples in the trajectories in the batch using the action selection neural network comprises: processing a respective convolutional block output for each observation included in each experience tuple in each trajectory using the recurrent block to generate a respective recurrent block output for each observation. 5. The method of claim 4 , wherein the action selection neural network further includes a fully connected block comprising one or more fully connected neural network layers; and wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network comprises: processing a respective recurrent block output for each observation included in each experience tuple in each trajectory in the batch in parallel using the fully connected block of the action selection neural network to generate the respective action selection output for each observation. 6. The method of claim 5 , wherein processing a respective recurrent block output for each observation included in each experience tuple in each trajectory in the batch in parallel using the fully connected block of the action selection neural network comprises: folding a time dimension of the recurrent block outputs into a batch dimension of the recurrent block outputs prior to processing the recurrent block outputs using the fully connected block of the action selection neural network. 7. The method of claim 1 , wherein each trajectory of experience tuples of the plurality of trajectories of experience tuples is generated by a respective actor computing unit of a plurality of actor computing units; wherein the plurality of actor computing units operate in parallel to generate the plurality of experience tuple trajectories. 8. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations computers for training an action selection neural network used to select actions to be performed by an agent interacting with an environment, the operations comprising: generating a plurality of trajectories of experience tuples, wherein: each trajectory of experience tuples characterizes interaction of a respective instance of the agent with a respective instance of the environment over a sequence of time steps while the respective instance of the agent performs actions selected in accordance with a respective action selection policy; and each trajectory of experience tuples comprises a sequence of experience tuples with each experience tuple corresponding to a respective time step and comprising a respective observation characterizing a state of the respective instance of the environment at the time step; selecting a batch of trajectories of experience tuples from the plurality of trajectories of experience tuples; and training the action selection neural network on the batch of trajectories of experience tuples, comprising: processing the observations included in the experience tuples in the trajectories in the batch using the action selection neural network to generate, for each observation in each experience tuple, a respective action selection output, comprising: processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network; and updating values of a set of parameters of the action selection neural network using the action selection outputs generated by the action selection neural network. 9. The system of claim 8 , wherein the action selection neural network includes a convolutional block comprising one or more convolutional neural network layers; and wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel by one or more neural network layers of the action selection neural network comprises: processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network to generate a respective convolutional block output for each observation. 10. The system of claim 9 , wherein processing each observation included in each experience tuple in each trajectory in the batch in parallel using the convolutional block of the action selection neural network comprises: folding a time dimension of the observations into a batch dimension of the observations prior to processing the observations using the convolutional block of the acti

Assignees

Deepmind Tech Ltd

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/092Primary
Reinforcement learning · CPC title
G06N3/098Primary
Distributed learning, e.g. federated learning · CPC title
G06N3/063
using electronic means · CPC title
G06N3/006
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

Patent family

Related publications grouped by family.

View patent family 65324355

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12299574B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection neural network used to select actions to be performed by an agent interacting with an environment. In one aspect, a system comprises a plurality of actor computing units and a plurality of learner computing units. The actor computing units generate experience tuple…
Who is the assignee on this patent?: Deepmind Tech Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).