Multi-agent reinforcement learning with matchmaking policies
US-12067491-B2 · Aug 20, 2024 · US
US12572803B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12572803-B2 |
| Application number | US-202418771770-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 12, 2024 |
| Priority date | Jan 24, 2019 |
| Publication date | Mar 10, 2026 |
| Grant date | Mar 10, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique to update the policy parameters. The policy parameters define policies to be used in controlling the agent to perform the particular task.
Opening claim text (preview).
What is claimed is: 1 . A computed-implemented method comprising: maintaining a pool of candidate action selection policies; maintaining, for each candidate action selection policy in the pool, a respective matchmaking policy that is configured to define a score distribution over the pool of candidate action selection policies, wherein the score distribution has a respective score for each of one or more previously selected candidate action selection policies of the candidate action selection policies that have been previously selected for generating training data, and wherein, for each of the one or more previously selected candidate action selection policies, the respective matchmaking policy is configured such that the respective score is dependent on a level of performance of the previously selected candidate action selection policy in controlling an agent to perform a task when interacting with one or more other agents that were controlled by other previously selected candidate action selection policies; training one or more of the candidate action selection policies in the pool using one or more of the matchmaking policies, wherein the training comprises, for a particular candidate action selection policy: selecting, in accordance with the score distribution defined by the respective matchmaking policy for the particular candidate action selection policy, one or more candidate action selection policies from the pool; generating training data for the particular candidate action selection policy by causing an agent controlled using the particular candidate action selection policy to perform the task while interacting with one or more other agents that are controlled by the selected one or more candidate action selection policies; and updating the particular candidate action selection policy through reinforcement learning by using the training data. 2 . The method of claim 1 , wherein the candidate action selection policies are each defined by a respective set of neural network parameters, and wherein updating the particular candidate action selection policy comprises updating values of a respective set of neural network parameters that define the particular candidate action selection policy. 3 . The method of claim 1 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a probability score. 4 . The method of claim 1 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a weighted probability score that is computed by multiplying a weight value with a probability score, wherein the weight value is computed using a weighting function. 5 . The method of claim 4 , wherein the weight values to be multiplied with different probability scores are different. 6 . The method of claim 4 , wherein the weight values to be multiplied with different probability scores are the same. 7 . The method of claim 3 , wherein for each of one or more previously selected candidate action selection policies, the respective score is configured to be proportional to the level of performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 8 . The method of claim 4 , wherein for each of one or more previously selected candidate action selection policies, the weight value to be multiplied with the respective score is configured to be proportional to the performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 9 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: maintaining a pool of candidate action selection policies; maintaining, for each candidate action selection policy in the pool, a respective matchmaking policy that is configured to define a score distribution over the pool of candidate action selection policies, wherein the score distribution has a respective score for each of one or more previously selected candidate action selection policies of the candidate action selection policies that have been previously selected for generating training data, and wherein, for each of the one or more previously selected candidate action selection policies, the respective matchmaking policy is configured such that the respective score is dependent on a level of performance of the previously selected candidate action selection policy in controlling an agent to perform a task when interacting with one or more other agents that were controlled by other previously selected candidate action selection policies; training one or more of the candidate action selection policies in the pool using one or more of the matchmaking policies, wherein the training comprises, for a particular candidate action selection policy: selecting, in accordance with the score distribution defined by the respective matchmaking policy for the particular candidate action selection policy, one or more candidate action selection policies from the pool; generating training data for the particular candidate action selection policy by causing an agent controlled using the particular candidate action selection policy to perform the task while interacting with one or more other agents that are controlled by the selected one or more candidate action selection policies; and updating the particular candidate action selection policy through reinforcement learning by using the training data. 10 . The system of claim 9 , wherein the candidate action selection policies are each defined by a respective set of neural network parameters, and wherein updating the particular candidate action selection policy comprises updating values of a respective set of neural network parameters that define the particular candidate action selection policy. 11 . The system of claim 9 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a probability score. 12 . The system of claim 9 , wherein the respective score for each of the one or more previously selected candidate action selection policies comprises a weighted probability score that is computed by multiplying a weight value with a probability score, wherein the weight value is computed using a weighting function. 13 . The system of claim 12 , wherein the weight values to be multiplied with different probability scores are different. 14 . The system of claim 12 , wherein the weight values to be multiplied with different probability scores are the same. 15 . The system of claim 11 , wherein for each of one or more previously selected candidate action selection policies, the respective score is configured to be proportional to the level of performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 16 . The system of claim 12 , wherein for each of one or more previously selected candidate action selection policies, the weight value to be multiplied with the respective score is configured to be proportional to the performance of the previously selected candidate action selection policy in controlling the agent to perform the task. 17 . One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: maintaining a pool of candidate actio
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
involving negotiation or determination of the one or more network security mechanisms to be used, e.g. by negotiation between the client and the server or between peers or by selection according to the capabilities of the entities involved (negotiation of communication capabilities H04L69/24) · CPC title
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
Reinforcement learning · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.