Multi-agent reinforcement learning with matchmaking policies
US-11627165-B2 · Apr 11, 2023 · US
US12067491B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12067491-B2 |
| Application number | US-202318131567-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 6, 2023 |
| Priority date | Jan 24, 2019 |
| Publication date | Aug 20, 2024 |
| Grant date | Aug 20, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique to update the policy parameters. The policy parameters define policies to be used in controlling the agent to perform the particular task.
Opening claim text (preview).
What is claimed is: 1. A method of training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, the method comprising: maintaining data specifying a pool of candidate action selection policies, the pool of candidate action selection policies comprising: (i) a plurality of learner polices for controlling the agent, wherein each learner policy is defined by a respective set of adjustable policy parameters of the policy neural network that are adjusted during the training of the policy neural network, and (ii) one or more fixed policies for controlling the agent, wherein each fixed policy is defined by a respective set of nonadjustable policy parameters that are not adjusted alongside the respective sets of adjustable policy parameters during the training of the policy neural network, and wherein during the training, at least some of actions performed by the one or more other agents are selected by using the one or more fixed policies and in accordance with the respective sets of nonadjustable policy parameters; maintaining, for each of the learner policies, data specifying a respective matchmaking policy for the learner policy that defines a distribution over the pool of candidate action selection policies; at each of a plurality of training iterations: for each of one or more of the learner policies: selecting one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy; generating training data for the learner policy by causing a first agent controlled using the learner policy to perform the particular task while interacting with one or more second agents, each second agent controlled by a respective one of the selected policies; and updating the respective set of policy parameters that define the learner policy by training the learner policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the learner policy. 2. The method of claim 1 , wherein the one or more fixed policies comprise: a first fixed policy defined by a first set of nonadjustable policy parameters having values that have been determined through supervised learning that took place prior to the training of the policy neural network. 3. The method of claim 1 , wherein the one or more fixed policies comprise: a second fixed policy defined by a second set of nonadjustable policy parameters that encode deterministic action selection logics. 4. The method of claim 2 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the first fixed policy. 5. The method of claim 3 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the second fixed policy. 6. The method of claim 1 , wherein the agent is a mechanical agent, and the environment is a real-world environment. 7. The method of claim 6 , wherein the mechanical agent comprises a robot, an autonomous, or semi-autonomous vehicle, and wherein causing the agent to perform the action comprises generating control inputs to control the mechanical agent in the real-world environment. 8. The method of claim 1 , wherein the agent is an electronic agent, and the environment is a simulated environment. 9. The method of claim 8 , wherein causing the agent to perform the action comprises generating control inputs to control the electronic agent in the simulated environment. 10. The method of claim 1 , wherein controlling the agent to perform the particular task while interacting with the one or more other agents in the environment comprises: controlling the agent to cooperate with the one or more other agents in the environment to perform the particular task. 11. The method of claim 1 , wherein controlling the agent to perform the particular task while interacting with the one or more other agents in the environment comprises: controlling the agent to compete with the one or more other agents in the environment to perform the particular task. 12. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, wherein the operations comprise: maintaining data specifying a pool of candidate action selection policies, the pool of candidate action selection policies comprising: (i) a plurality of learner polices for controlling the agent, wherein each learner policy is defined by a respective set of adjustable policy parameters of the policy neural network that are adjusted during the training of the policy neural network, and (ii) one or more fixed policies for controlling the agent, wherein each fixed policy is defined by a respective set of nonadjustable policy parameters that are not adjusted alongside the respective sets of adjustable policy parameters during the training of the policy neural network, and wherein during the training, at least some of actions performed by the one or more other agents are selected by using the one or more fixed policies and in accordance with the respective sets of nonadjustable policy parameters; maintaining, for each of the learner policies, data specifying a respective matchmaking policy for the learner policy that defines a distribution over the pool of candidate action selection policies; at each of a plurality of training iterations: for each of one or more of the learner policies: selecting one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy; generating training data for the learner policy by causing a first agent controlled using the learner policy to perform the particular task while interacting with one or more second agents, each second agent controlled by a respective one of the selected policies; and updating the respective set of policy parameters that define the learner policy by training the learner policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the learner policy. 13. The system of claim 12 , wherein the one or more fixed policies comprise one or both of: a first fixed policy defined by a first set of nonadjustable policy parameters having values that have been determined through supervised learning that took place prior to the training of the policy neural network, or a second fixed policy defined by a second set of nonadjustable policy parameters that encode deterministic action selection logics. 14. The system of claim 13 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the first fixed policy. 15. The system of claim 13 , wherein selecting the one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy comprises: selecting the second fixed policy. 16. The system of claim 12 , wherein
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
Reinforcement learning · CPC title
Supervised learning · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
involving negotiation or determination of the one or more network security mechanisms to be used, e.g. by negotiation between the client and the server or between peers or by selection according to the capabilities of the entities involved (negotiation of communication capabilities H04L69/24) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.