Determining action selection policies of an execution device

US11077368B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11077368-B2
Application numberUS-202017084363-A
CountryUS
Kind codeB2
Filing dateOct 29, 2020
Priority dateDec 12, 2019
Publication dateAug 3, 2021
Grant dateAug 3, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, are described, for generating an action selection policy of an execution device for completing a task in an environment. The method includes, in a current iteration, computing a counterfactual value (CFV) of the execution device in a terminal state based on a payoff of the execution device and a reach probability of other devices reaching the terminal state; computing a baseline-corrected CFV of the execution device in the terminal state; for each non-terminal state having child states, computing a CFV of the execution device in the non-terminal state based on a weighted sum of the baseline-corrected CFVs of the execution device in the child states; computing a baseline-corrected CFV and a CFV baseline of the execution device in the non-terminal state; and determining an action selection policy in the non-terminal state for the next iteration.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method of an execution device for generating an action selection policy for completing a task in an environment that includes the execution device and one or more other devices, the method comprising: in a current iteration of a plurality of iterations, computing, by the execution device, a counterfactual value (CFV) of the execution device in a terminal state of completing a task based on a payoff of the execution device at the terminal state and a reach probability of the one or more other devices reaching the terminal state, wherein the terminal state results from a sequence of actions taken at a plurality of non-terminal states by the execution device and by the one or more other devices, wherein each of the plurality of non-terminal states has one or more child states; computing, by the execution device, a baseline-corrected CFV of the execution device in the terminal state based on the CFV of the execution device in the terminal state, a CFV baseline of the execution device in the terminal state of a previous iteration, or the CFV of the execution device in the terminal state and the CFV baseline of the execution device in the terminal state of the previous iteration; for each of the non-terminal states and starting from a non-terminal state that has the terminal state and one or more other terminal states as child states: computing, by the execution device, a CFV of the execution device in the non-terminal state based on a weighted sum of baseline-corrected CFVs of the execution device in the child states of the non-terminal state; computing, by the execution device, a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration; computing, by the execution device, a CFV baseline of the execution device in the non-terminal state of the current iteration based on a weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV of the execution device in the non-terminal state or the baseline-corrected CFV of the execution device in the non-terminal state; and determining, by the execution device, an action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration. 2. The computer-implemented method of claim 1 , further comprising, in response to determining that a convergence condition is met, controlling operations of the execution device in the non-terminal state based on the action selection policy in the non-terminal state for the next iteration. 3. The computer-implemented method of claim 1 , wherein determining the action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration comprises: calculating a regret value based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration; and determining the action selection policy in the non-terminal state for the next iteration based on the regret value according to regret matching. 4. The computer-implemented method of claim 1 , wherein the reach probability of the one or more other devices reaching the terminal state comprises a product of probabilities of actions taken by the one or more other devices reach the terminal state. 5. The computer-implemented method of claim 1 , wherein computing a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration comprises: computing a sampled CFV baseline of the execution device that takes an action in the terminal state of the previous iteration based on the CFV baseline of the execution device in the terminal state of the previous iteration, a sampling policy of the execution device that takes the action in the terminal state of the previous iteration, and a probability of reaching the terminal state that results from the sequence of actions taken by the execution device; in response to determining that the action is sampled, computing a baseline-corrected CFV of the execution device that takes the action in the non-terminal state based on the CFV of the execution device in the non-terminal state and the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration; and in response to determining that the action is not sampled, using the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration as the baseline-corrected CFV of the execution device in the non-terminal state. 6. The computer-implemented method of claim 1 , wherein the weighted sum of the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states is computed based on the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states weighted by an action selection policy in the non-terminal state in the current iteration. 7. The computer-implemented method of claim 1 , wherein the weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV or the baseline-corrected CFV of the execution device in the non-terminal state comprises a sum of: the CFV baseline of the execution device in the non-terminal state of the previous iteration weighted by a scalar; and the CFV or the baseline-corrected CFV of the execution device in the non-terminal state weighted by a second scalar and a probability of considering the non-terminal state. 8. A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations comprising: in a current iteration of a plurality of iterations, computing a counterfactual value (CFV) of an execution device in a terminal state of completing a task based on a payoff of the execution device at the terminal state and a reach probability of one or more other devices reaching the terminal state, wherein the terminal state results from a sequence of actions taken at a plurality of non-terminal states by the execution device and by the one or more other devices, wherein each of the plurality of non-terminal states has one or more child states; computing a baseline-corrected CFV of the execution device in the terminal state based on the CFV of the execution device in the terminal state, a CFV baseline of the execution device in the terminal state of a previous iteration, or the CFV of the execution device in the terminal state and the CFV baseline of the execution device in the terminal state of the previous iteration; for each of the non-terminal states and starting from a non-terminal state that has the terminal state and one or more other terminal states as child states: computing a CFV of the execution device in the non-terminal state based on a weighted sum of baseline-corrected

Assignees

Inventors

Classifications

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration (scheduling strategies G06F9/4881 and subgroups) · CPC title

  • G06Q20/405Primary

    Establishing or using transaction specific rules · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11077368B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, are described, for generating an action selection policy of an execution device for completing a task in an environment. The method includes, in a current iteration, computing a counterfactual value (CFV) of the execution device in a terminal state based on a payoff of the execution device and a reac…
Who is the assignee on this patent?
Alipay Hangzhou Inf Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06Q20/405. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 03 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).