Systems and methods for solving multi-agent decision processes with network constraints

US2024160943A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024160943-A1
Application numberUS-202218054009-A
CountryUS
Kind codeA1
Filing dateNov 9, 2022
Priority dateNov 9, 2022
Publication dateMay 16, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide systems and methods for solving and applying a multi-agent decision process. A system performs a process, where at each iterative step, the system determines policies for a plurality of agents that optimize respective reward values based on the plurality of costs, and the characteristics of the plurality of agents. The system simulates the multi-agent decision process using the determined policies, thereby generating respective reward values and aggregated resource contribution values. The system increments or decrements the plurality of costs based on the constraints and the aggregated resource contribution values. The system updates a final reward value based on the respective reward values. The system updates a final plurality of costs based on the plurality of costs. After performing the iterative step for a predetermined number of iterations, the system outputs the final reward value and the final plurality of costs.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for policy control in a dynamic system via a multi-agent reinforcement learning network, the system comprising: a memory that stores network information and a plurality of processor-executable instructions; a communication interface that receives characteristics of a plurality of agents, and constraints for a plurality of resources of a dynamic system; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations including: allocating initial values for a plurality of costs associated with the plurality of resources; and performing, at an iterative step: determining policies for the plurality of agents that optimize respective reward values based on the plurality of costs, and the characteristics of the plurality of agents; simulating a multi-agent decision process using the determined policies, the plurality of costs, and the characteristics of the plurality of agents, thereby generating respective reward values and aggregated resource contribution values; incrementing or decrementing the plurality of costs based on the constraints and the aggregated resource contribution values; updating a final reward value based on the generated respective reward values; and updating a final plurality of costs based on the plurality of costs; continuing performing the iterative step for a predetermined number of iterations; and outputting the final reward value and the final plurality of costs. 2 . The system of claim 1 , wherein determining policies for the plurality of agents includes: determining policies for the plurality of agents that maximizes the respective reward values subject to the constraints for the plurality of resources of the dynamic system. 3 . The system of claim 2 , wherein determining policies for the plurality of agents includes: computing a Lagrangian having a mixed deterministic Markov policy for each agent, and taking an expectation with respect to a probability distribution of the respective reward values induced by the mixed deterministic Markov policy. 4 . The system of claim 1 , wherein the final plurality of costs is a weighted average of the costs over multiple iterative steps. 5 . The system of claim 1 , wherein the final reward value is a weighted average of the respective reward values over multiple iterative steps. 6 . The system of claim 1 , wherein: the communication interface further receives a learning rate value, and the incrementing or decrementing is further based on the learning rate value. 7 . The system of claim 1 , wherein the incrementing or decrementing is based on respective differences between the constraints and the aggregated resource contribution values associated with respective constraints. 8 . The system of claim 1 , wherein determining policies for the plurality of agents is performed on a subset of the plurality of agents at each time step. 9 . A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving characteristics of a plurality of agents, and constraints for a plurality of resources of a dynamic system; allocating initial values for a plurality of costs associated with the plurality of resources; and performing, at an iterative step: determining policies for the plurality of agents that optimize respective reward values based on the plurality of costs, and the characteristics of the plurality of agents; simulating a multi-agent decision process using the determined policies, the plurality of costs, and the characteristics of the plurality of agents, thereby generating respective reward values and aggregated resource contribution values; incrementing or decrementing the plurality of costs based on the constraints and the aggregated resource contribution values; updating a final reward value based on the generated respective reward values; and updating a final plurality of costs based on the plurality of costs; and continuing performing the iterative step for a predetermined number of iterations; and outputting the final reward value and the final plurality of costs. 10 . The non-transitory machine-readable medium of claim 9 , wherein determining policies for the plurality of agents includes: determining policies for the plurality of agents that maximizes the respective reward values subject to the constraints for the plurality of resources of the dynamic system. 11 . The non-transitory machine-readable medium of claim 10 , wherein determining policies for the plurality of agents includes: computing a Lagrangian having a mixed deterministic Markov policy for each agent, and taking an expectation with respect to a probability distribution of the respective reward values induced by the mixed deterministic Markov policy. 12 . The non-transitory machine-readable medium of claim 9 , wherein the final plurality of costs is a weighted average of the costs over multiple iterative steps. 13 . The non-transitory machine-readable medium of claim 9 , wherein the final reward value is a weighted average of the respective reward values over multiple iterative steps. 14 . The non-transitory machine-readable medium of claim 9 , wherein the operations further comprise receiving a learning rate value, and the incrementing or decrementing is further based on the learning rate value. 15 . The non-transitory machine-readable medium of claim 9 , wherein the incrementing or decrementing is based on respective differences between the constraints and the aggregated resource contribution values associated with respective constraints. 16 . The non-transitory machine-readable medium of claim 9 , wherein determining policies for the plurality of agents is performed on a subset of the plurality of agents at each time step. 17 . A method of policy control in a dynamic system via a multi-agent reinforcement learning network, the method comprising: receiving, via a data interface, characteristics of a plurality of agents, and constraints for a plurality of resources of a dynamic system; allocating initial values for a plurality of costs associated with the plurality of resources; and performing, at an iterative step: determining policies for the plurality of agents that optimize respective reward values based on the plurality of costs, and the characteristics of the plurality of agents; simulating a multi-agent decision process using the determined policies, the plurality of costs, and the characteristics of the plurality of agents, thereby generating respective reward values and aggregated resource contribution values; incrementing or decrementing the plurality of costs based on the constraints and the aggregated resource contribution values; updating a final reward value based on the generated respective reward values; and updating a final plurality of costs based on the plurality of costs; and continuing performing the iterative step for a predetermined number of iterations; and outputting the final reward value and the final plurality of costs. 18 . The method of claim 17 , wherein determining policies for the plurality of agents includes: determining policies for the plurality of agents that maximizes the respective reward values subject to the constraints for the plurality of resources of the dynamic system. 19 . The method

Assignees

Inventors

Classifications

  • G06N3/092Primary

    Reinforcement learning · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • G06N7/01Primary

    Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024160943A1 cover?
Embodiments described herein provide systems and methods for solving and applying a multi-agent decision process. A system performs a process, where at each iterative step, the system determines policies for a plurality of agents that optimize respective reward values based on the plurality of costs, and the characteristics of the plurality of agents. The system simulates the multi-agent decisi…
Who is the assignee on this patent?
Salesforce Com Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).