Determining action selection policies of an execution device

US11204803B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11204803-B2
Application numberUS-202117219038-A
CountryUS
Kind codeB2
Filing dateMar 31, 2021
Priority dateApr 2, 2020
Publication dateDec 21, 2021
Grant dateDec 21, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computer-implemented methods, systems, and apparatus, including computer-readable medium, for generating an action selection policy for causing an execution device to complete a task are described. Data representing a task that is divided into a sequence of subtasks are obtained. Data specifying a strategy neural network (SNN) for a subtask in the sequence of subtasks are obtained. The SNN receives inputs include a sequence of actions that reach an initial state of the subtask, and predicts an action selection policy of the execution device for the subtask. The SNN is trained based on a value neural network (VNN) for a next subtask that follows the subtask in the sequence of subtasks. An input to the SNN is determined. The input includes a sequence of actions that reach a subtask initial state of the subtask. An action selection policy for completing the subtask is determined based on an output of the SNN.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating an action selection policy causing an execution device to complete a task in an environment that includes the execution device and one or more other devices, the method comprising: obtaining data representing a task that is divided into a sequence of subtasks, wherein: the task comprises a task initial state, a plurality of non-task-terminal states, and a plurality of task terminal states, wherein each of the plurality of the task terminal states results from a sequence of actions taken by the execution device and by the one or more other devices in a subset of the plurality of non-task-terminal states, and the plurality of the task terminal states have respective rewards in the task terminal states, each subtask in the sequence of subtasks comprises one or more subtask initial states and a plurality of subtask terminal states of the subtask, and except for a last subtask in the sequence of subtasks, the plurality of subtask terminal states of the subtask are a plurality of subtask initial states of a next subtask that follows the subtask in the sequence of subtasks; determining that a specified subtask in the sequence of subtasks has a complexity that exceeds a threshold; in response to determining that the specified subtask in the sequence of subtasks has the complexity that exceeds the threshold, obtaining data specifying a strategy neural network (SNN) for the specified subtask in the sequence of subtasks, wherein the SNN for the specified subtask receives inputs comprising a sequence of actions taken by the execution device and by the one or more other devices that reach a subtask initial state of the specified subtask, and predicts an action selection policy of the execution device for the specified subtask, wherein the SNN for the specified subtask is trained based on a value neural network (VNN) for a next subtask that follows the specified subtask in the sequence of subtasks, wherein the VNN for the next subtask receives inputs comprising reach probabilities of the execution device and the one or more other devices reaching a subtask initial state of the next subtask, and predicts a reward of the execution device in the subtask initial state of the next subtask; determining a specified input to the SNN for the specified subtask, wherein the specified input comprises a specified sequence of actions taken by the execution device and by the one or more other devices that reach the subtask initial state of the specified subtask; and determining an action selection policy for completing the specified subtask based on an output of the SNN for the specified subtask with the specified input to the SNN for the specified subtask. 2. The method of claim 1 , further comprising controlling operations of the execution device in the specified subtask according to the action selection policy for completing the specified subtask. 3. The method of claim 1 , further comprising: determining another subtask in the sequence of subtasks that has another complexity below the threshold; and determining another action selection policy for completing the another subtask by performing a tabular counterfactual regret minimization (CFR) algorithm to the another subtask. 4. The method of claim 1 , further comprising: obtaining another SNN for another subtask in the sequence of subtasks that has a second complexity exceeding a second threshold, wherein the another subtask is behind the specified subtask in the sequence of subtasks, and the another SNN for the another subtask is trained independently from the SNN for the specified subtask; and determining another action selection policy for completing the another subtask by inputting, into the another SNN for the another subtask, another sequence of actions taken by the execution device and by the one or more other devices that reach an initial state of the another subtask from the task initial state, wherein the another sequence of actions comprises the sequence of actions. 5. The method of claim 1 , further comprising determining an overall action selection policy for completing the task by determining a respective action selection policy for each of the subtask according to an order of the sequence of subtasks from a first subtask that comprises the task initial state to the last subtask that comprises the plurality of task terminal states. 6. The method of claim 1 , wherein the SNN for the specified subtask is trained based on the VNN for the next subtask by: predicting a plurality of rewards in the plurality of subtask terminal states of the specified subtask based on the VNN for the next subtask; and training the SNN for the specified subtask based on the plurality of rewards in the plurality of subtask terminal states of the specified subtask according to a neural-network-based CFR algorithm. 7. A computer-implemented system for generating an action selection policy causing an execution device to complete a task in an environment that includes the execution device and one or more other devices, the computer-implemented system comprising: one or more processors; and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform operations comprising: obtaining data representing a task that is divided into a sequence of subtasks, wherein: the task comprises a task initial state, a plurality of non-task-terminal states, and a plurality of task terminal states, wherein each of the plurality of the task terminal states results from a sequence of actions taken by the execution device and by the one or more other devices in a subset of the plurality of non-task-terminal states, and the plurality of the task terminal states have respective rewards in the task terminal states, each subtask in the sequence of subtasks comprises one or more subtask initial states and a plurality of subtask terminal states of the subtask, and except for a last subtask in the sequence of subtasks, the plurality of subtask terminal states of the subtask are a plurality of subtask initial states of a next subtask that follows the subtask in the sequence of subtasks; in response to determining that a specified subtask in the sequence of subtasks has a complexity that exceeds a threshold, obtaining data specifying a strategy neural network (SNN) for the specified subtask in the sequence of subtasks, wherein the SNN for the specified subtask receives inputs comprising a sequence of actions taken by the execution device and by the one or more other devices that reach a subtask initial state of the specified subtask, and predicts an action selection policy of the execution device for the specified subtask, wherein the SNN for the specified subtask is trained based on a value neural network (VNN) for a next subtask that follows the specified subtask in the sequence of subtasks, wherein the VNN for the next subtask receives inputs comprising reach probabilities of the execution device and the one or more other devices reaching a subtask initial state of the next subtask, and predicts a reward of the execution device in the subtask initial state of the next subtask; determining a specified input to the SNN for the specified subtask, wherein the specified input comprises a specified sequence of actions taken by the execution device and by the one or more other devices that reach the subtask initial state of the specified subtask; and determining an action selection policy for completing the specified subtask based on an output of the SNN for the specified subtask with the specified input to the SNN for the specified subtask. 8. The computer-implemented system of claim

Assignees

Inventors

Classifications

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Combinations of networks · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • Feedforward networks · CPC title

  • Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11204803B2 cover?
Computer-implemented methods, systems, and apparatus, including computer-readable medium, for generating an action selection policy for causing an execution device to complete a task are described. Data representing a task that is divided into a sequence of subtasks are obtained. Data specifying a strategy neural network (SNN) for a subtask in the sequence of subtasks are obtained. The SNN rece…
Who is the assignee on this patent?
Alipay Hangzhou Inf Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 21 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).