What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multi-agent reinforcement learning with matchmaking policies

US12067491B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12067491-B2
Application number	US-202318131567-A
Country	US
Kind code	B2
Filing date	Apr 6, 2023
Priority date	Jan 24, 2019
Publication date	Aug 20, 2024
Grant date	Aug 20, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique to update the policy parameters. The policy parameters define policies to be used in controlling the agent to perform the particular task.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, the method comprising: maintaining data specifying a pool of candidate action selection policies, the pool of candidate action selection policies comprising: (i) a plurality of learner polices for controlling the agent, wherein each learner policy is defined by a respective set of adjustable policy parameters of the policy neural network that are adjusted during the training of the policy neural network, and (ii) one or more fixed policies for controlling the agent, wherein each fixed policy is defined by a respective set of nonadjustable policy parameters that are not adjusted alongside the respective sets of adjustable policy parameters during the training of the policy neural network, and wherein during the training, at least some of actions performed by the one or more other agents are selected by using the one or more fixed policies and in accordance with the respective sets of nonadjustable policy parameters; maintaining, for each of the learner policies, data specifying a respective matchmaking policy for the learner policy that defines a distribution over the pool of candidate action selection policies; at each of a plurality of training iterations: for each of one or more of the learner policies: selecting one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy; generating training data for the learner policy by causing a first agent controlled using the learner policy to perform the particular task while interacting with one or more second agents, each second agent controlled by a respective one of the selected policies; and updating the respective set of policy parameters that define the learner policy by training the learner policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the learner policy. 2. The method of claim 1 , wherein the one or more fixed policies comprise: a first fixed policy defined by a first set of nonadjustable policy parameters having values that have been determined through supervised learning that took place prior to the training of the policy neural network. 3. The method of claim 1 , wherein the one or more fixed policies comprise: a second fixed policy defined by a second set of nonadjustable policy parameters that encode deterministic action selection logics. 4. The method of claim 2 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the first fixed policy. 5. The method of claim 3 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the second fixed policy. 6. The method of claim 1 , wherein the agent is a mechanical agent, and the environment is a real-world environment. 7. The method of claim 6 , wherein the mechanical agent comprises a robot, an autonomous, or semi-autonomous vehicle, and wherein causing the agent to perform the action comprises generating control inputs to control the mechanical agent in the real-world environment. 8. The method of claim 1 , wherein the agent is an electronic agent, and the environment is a simulated environment. 9. The method of claim 8 , wherein causing the agent to perform the action comprises generating control inputs to control the electronic agent in the simulated environment. 10. The method of claim 1 , wherein controlling the agent to perform the particular task while interacting with the one or more other agents in the environment comprises: controlling the agent to cooperate with the one or more other agents in the environment to perform the particular task. 11. The method of claim 1 , wherein controlling the agent to perform the particular task while interacting with the one or more other agents in the environment comprises: controlling the agent to compete with the one or more other agents in the environment to perform the particular task. 12. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, wherein the operations comprise: maintaining data specifying a pool of candidate action selection policies, the pool of candidate action selection policies comprising: (i) a plurality of learner polices for controlling the agent, wherein each learner policy is defined by a respective set of adjustable policy parameters of the policy neural network that are adjusted during the training of the policy neural network, and (ii) one or more fixed policies for controlling the agent, wherein each fixed policy is defined by a respective set of nonadjustable policy parameters that are not adjusted alongside the respective sets of adjustable policy parameters during the training of the policy neural network, and wherein during the training, at least some of actions performed by the one or more other agents are selected by using the one or more fixed policies and in accordance with the respective sets of nonadjustable policy parameters; maintaining, for each of the learner policies, data specifying a respective matchmaking policy for the learner policy that defines a distribution over the pool of candidate action selection policies; at each of a plurality of training iterations: for each of one or more of the learner policies: selecting one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy; generating training data for the learner policy by causing a first agent controlled using the learner policy to perform the particular task while interacting with one or more second agents, each second agent controlled by a respective one of the selected policies; and updating the respective set of policy parameters that define the learner policy by training the learner policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the learner policy. 13. The system of claim 12 , wherein the one or more fixed policies comprise one or both of: a first fixed policy defined by a first set of nonadjustable policy parameters having values that have been determined through supervised learning that took place prior to the training of the policy neural network, or a second fixed policy defined by a second set of nonadjustable policy parameters that encode deterministic action selection logics. 14. The system of claim 13 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the first fixed policy. 15. The system of claim 13 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the second fixed policy. 16. The system of claim 12 , wherein

Assignees

Deepmind Tech Ltd

Inventors

Classifications

G06N3/0985
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
G06N3/092
Reinforcement learning · CPC title
G06N3/09
Supervised learning · CPC title
G06F18/214
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
H04L63/205
involving negotiation or determination of the one or more network security mechanisms to be used, e.g. by negotiation between the client and the server or between peers or by selection according to the capabilities of the entities involved (negotiation of communication capabilities H04L69/24) · CPC title

Patent family

Related publications grouped by family.

View patent family 69232860

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12067491B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining d…
Who is the assignee on this patent?: Deepmind Tech Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Multi-agent reinforcement learning with matchmaking policies

Cooperative multi-goal, multi-agent, multi-stage reinforcement learning

Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents

Interaction-aware decision making

Methods and systems for reinforcement learning

Hybrid reward architecture for reinforcement learning

Frequently asked questions