Multi-agent reinforcement learning with matchmaking policies

US11627165B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11627165-B2
Application numberUS-202016752496-A
CountryUS
Kind codeB2
Filing dateJan 24, 2020
Priority dateJan 24, 2019
Publication dateApr 11, 2023
Grant dateApr 11, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique to update the policy parameters. The policy parameters define policies to be used in controlling the agent to perform the particular task.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, the method comprising: maintaining data specifying a pool of candidate action selection policies, the pool of candidate action selection policies comprising: (i) a plurality of learner polices for controlling the agent, each learner policy defined by a respective set of values for the policy parameters of the policy neural network, and (ii) one or more fixed policies for controlling the agent; maintaining, for each of the learner policies, data specifying a respective matchmaking policy for the learner policy that defines a distribution over the pool of candidate action selection policies; at a particular training iteration of a plurality of training iterations: for each of one or more of the learner policies: selecting one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy; generating training data for the learner policy by causing a first agent controlled using the learner policy to perform the particular task while interacting with one or more second agents, each second agent controlled by a respective one of the selected policies; updating the respective set of values for the policy parameters that define the learner policy by training the learner policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the learner policy; determining that criteria for converting a particular one of the plurality of learner policies into a fixed policy have been satisfied; and in response, generating a new fixed policy that is defined by a same set of values for the policy parameters as the particular learner policy. 2. The method of claim 1 , wherein the matchmaking policies for two or more of the learner policies are different. 3. The method of claim 2 , wherein the learner policies are each assigned a respective type from a plurality of types, wherein each type is associated with a different matchmaking policy from each other type, and wherein each learner policy has the matchmaking policy that is associated with the type to which the learner policy is assigned. 4. The method of claim 1 , wherein the matchmaking policy for at least one learner policy is uniform across one or more learner policies that are assigned a particular type and zero for all of the learner policies that are assigned different types and all of the fixed policies. 5. The method of claim 1 , wherein the matchmaking policy for at least one learner policy is uniform across all of the learner policies and zero for all of the fixed policies. 6. The method of claim 1 , wherein the matchmaking policy for at least one learner policy is uniform across all policies in the pool. 7. The method of claim 1 , wherein the reinforcement learning loss function depends on a plurality of hyperparameters, and wherein values for the plurality of hyperparameters are different for two or more of the learner policies. 8. The method of claim 7 , wherein the hyperparameters include one or more hyperparameters of a reinforcement learning algorithm used in the training. 9. The method of claim 7 , wherein the hyperparameters include one or more internal reward hyperparameters that define whether the reinforcement learning loss function depends on an internal reward and, if so, how the internal reward is computed based on observations received by the agent during performance of the task. 10. The method of claim 1 , wherein the one or more fixed policies include a first fixed policy that is defined by values of the policy parameters that have been determined through supervised learning on labeled task instances. 11. The method of claim 10 , wherein the supervised learning comprises a first supervised learning using first training data and a second supervised learning using only a selected portion of the first training data that includes only labeled task instances performed by agents that have attained at least a threshold level of performance on the particular task. 12. The method of claim 1 , wherein determining that criteria have been satisfied comprises determining that a predetermined number of training iterations have been completed. 13. The method of claim 1 , further comprising: in response to determining that criteria for converting the particular one of the plurality of learner policies into the fixed policy have been satisfied: setting the set of values for the policy parameters that define the particular learner policy to a new set of values that is determined based on current sets of values for policy parameters that define one or more of the other policies in the pool. 14. The method of claim 13 , wherein setting the set of values for the policy parameters that define the particular learner policy to the new set of values that is determined based on the current sets of values for policy parameters that define one or more of the other policies in the pool comprises: setting the set of values for the policy parameters to a current set of values for policy parameters that define one of the fixed policies. 15. The method of claim 14 , further comprising: in response: modifying hyperparameters of the reinforcement learning loss function for the particular learner policy. 16. The method of claim 15 , further comprising: in response: modifying the matchmaking policy for the particular learner policy. 17. The method of claim 1 , further comprising, for at least one of the selected policies: updating the respective set of values for the policy parameters that define the selected policy by training the selected policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the selected policy. 18. The method of claim 1 , wherein determining that criteria have been satisfied comprises determining that the agent controlled by the particular leaner policy has attained a threshold level of performance on the particular task. 19. The method of claim 1 , wherein the matchmaking policy for at least one learner policy specifies that the learner policies controlling respective agents that have attained higher levels of performance on the particular task are more likely to be selected than other learner policies controlling the respective agents that have attained lower levels of performance on the particular task. 20. One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, the operations comprising: maintaining data specifying a pool of candidate action selection policies, the pool of candidate action selection policies comprising: (i) a plurality of learner polices for controlling the agent, each learner policy defined by a respective set of values for the policy parameters of the policy neural network, and (ii) one or more fixed policies for controlling the agent; maintaining, for each of the learner policies, data specifying a respective matchma

Assignees

Inventors

Classifications

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Reinforcement learning · CPC title

  • Supervised learning · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11627165B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining d…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 11 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).