Multi-agent reinforcement learning with matchmaking policies

US12572803B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12572803-B2
Application numberUS-202418771770-A
CountryUS
Kind codeB2
Filing dateJul 12, 2024
Priority dateJan 24, 2019
Publication dateMar 10, 2026
Grant dateMar 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique to update the policy parameters. The policy parameters define policies to be used in controlling the agent to perform the particular task.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computed-implemented method comprising: maintaining a pool of candidate action selection policies; maintaining, for each candidate action selection policy in the pool, a respective matchmaking policy that is configured to define a score distribution over the pool of candidate action selection policies, wherein the score distribution has a respective score for each of one or more previously selected candidate action selection policies of the candidate action selection policies that have been previously selected for generating training data, and wherein, for each of the one or more previously selected candidate action selection policies, the respective matchmaking policy is configured such that the respective score is dependent on a level of performance of the previously selected candidate action selection policy in controlling an agent to perform a task when interacting with one or more other agents that were controlled by other previously selected candidate action selection policies; training one or more of the candidate action selection policies in the pool using one or more of the matchmaking policies, wherein the training comprises, for a particular candidate action selection policy: selecting, in accordance with the score distribution defined by the respective matchmaking policy for the particular candidate action selection policy, one or more candidate action selection policies from the pool; generating training data for the particular candidate action selection policy by causing an agent controlled using the particular candidate action selection policy to perform the task while interacting with one or more other agents that are controlled by the selected one or more candidate action selection policies; and updating the particular candidate action selection policy through reinforcement learning by using the training data. 2 . The method of claim 1 , wherein the candidate action selection policies are each defined by a respective set of neural network parameters, and wherein updating the particular candidate action selection policy comprises updating values of a respective set of neural network parameters that define the particular candidate action selection policy. 3 . The method of claim 1 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a probability score. 4 . The method of claim 1 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a weighted probability score that is computed by multiplying a weight value with a probability score, wherein the weight value is computed using a weighting function. 5 . The method of claim 4 , wherein the weight values to be multiplied with different probability scores are different. 6 . The method of claim 4 , wherein the weight values to be multiplied with different probability scores are the same. 7 . The method of claim 3 , wherein for each of one or more previously selected candidate action selection policies, the respective score is configured to be proportional to the level of performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 8 . The method of claim 4 , wherein for each of one or more previously selected candidate action selection policies, the weight value to be multiplied with the respective score is configured to be proportional to the performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 9 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: maintaining a pool of candidate action selection policies; maintaining, for each candidate action selection policy in the pool, a respective matchmaking policy that is configured to define a score distribution over the pool of candidate action selection policies, wherein the score distribution has a respective score for each of one or more previously selected candidate action selection policies of the candidate action selection policies that have been previously selected for generating training data, and wherein, for each of the one or more previously selected candidate action selection policies, the respective matchmaking policy is configured such that the respective score is dependent on a level of performance of the previously selected candidate action selection policy in controlling an agent to perform a task when interacting with one or more other agents that were controlled by other previously selected candidate action selection policies; training one or more of the candidate action selection policies in the pool using one or more of the matchmaking policies, wherein the training comprises, for a particular candidate action selection policy: selecting, in accordance with the score distribution defined by the respective matchmaking policy for the particular candidate action selection policy, one or more candidate action selection policies from the pool; generating training data for the particular candidate action selection policy by causing an agent controlled using the particular candidate action selection policy to perform the task while interacting with one or more other agents that are controlled by the selected one or more candidate action selection policies; and updating the particular candidate action selection policy through reinforcement learning by using the training data. 10 . The system of claim 9 , wherein the candidate action selection policies are each defined by a respective set of neural network parameters, and wherein updating the particular candidate action selection policy comprises updating values of a respective set of neural network parameters that define the particular candidate action selection policy. 11 . The system of claim 9 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a probability score. 12 . The system of claim 9 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a weighted probability score that is computed by multiplying a weight value with a probability score, wherein the weight value is computed using a weighting function. 13 . The system of claim 12 , wherein the weight values to be multiplied with different probability scores are different. 14 . The system of claim 12 , wherein the weight values to be multiplied with different probability scores are the same. 15 . The system of claim 11 , wherein for each of one or more previously selected candidate action selection policies, the respective score is configured to be proportional to the level of performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 16 . The system of claim 12 , wherein for each of one or more previously selected candidate action selection policies, the weight value to be multiplied with the respective score is configured to be proportional to the performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 17 . One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: maintaining a pool of candidate actio

Assignees

Inventors

Classifications

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • involving negotiation or determination of the one or more network security mechanisms to be used, e.g. by negotiation between the client and the server or between peers or by selection according to the capabilities of the entities involved (negotiation of communication capabilities H04L69/24) · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Reinforcement learning · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12572803B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining d…
Who is the assignee on this patent?
Gdm Holding Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).