Composite task execution
US-2019324795-A1 · Oct 24, 2019 · US
US11204803B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11204803-B2 |
| Application number | US-202117219038-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 31, 2021 |
| Priority date | Apr 2, 2020 |
| Publication date | Dec 21, 2021 |
| Grant date | Dec 21, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Computer-implemented methods, systems, and apparatus, including computer-readable medium, for generating an action selection policy for causing an execution device to complete a task are described. Data representing a task that is divided into a sequence of subtasks are obtained. Data specifying a strategy neural network (SNN) for a subtask in the sequence of subtasks are obtained. The SNN receives inputs include a sequence of actions that reach an initial state of the subtask, and predicts an action selection policy of the execution device for the subtask. The SNN is trained based on a value neural network (VNN) for a next subtask that follows the subtask in the sequence of subtasks. An input to the SNN is determined. The input includes a sequence of actions that reach a subtask initial state of the subtask. An action selection policy for completing the subtask is determined based on an output of the SNN.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for generating an action selection policy causing an execution device to complete a task in an environment that includes the execution device and one or more other devices, the method comprising: obtaining data representing a task that is divided into a sequence of subtasks, wherein: the task comprises a task initial state, a plurality of non-task-terminal states, and a plurality of task terminal states, wherein each of the plurality of the task terminal states results from a sequence of actions taken by the execution device and by the one or more other devices in a subset of the plurality of non-task-terminal states, and the plurality of the task terminal states have respective rewards in the task terminal states, each subtask in the sequence of subtasks comprises one or more subtask initial states and a plurality of subtask terminal states of the subtask, and except for a last subtask in the sequence of subtasks, the plurality of subtask terminal states of the subtask are a plurality of subtask initial states of a next subtask that follows the subtask in the sequence of subtasks; determining that a specified subtask in the sequence of subtasks has a complexity that exceeds a threshold; in response to determining that the specified subtask in the sequence of subtasks has the complexity that exceeds the threshold, obtaining data specifying a strategy neural network (SNN) for the specified subtask in the sequence of subtasks, wherein the SNN for the specified subtask receives inputs comprising a sequence of actions taken by the execution device and by the one or more other devices that reach a subtask initial state of the specified subtask, and predicts an action selection policy of the execution device for the specified subtask, wherein the SNN for the specified subtask is trained based on a value neural network (VNN) for a next subtask that follows the specified subtask in the sequence of subtasks, wherein the VNN for the next subtask receives inputs comprising reach probabilities of the execution device and the one or more other devices reaching a subtask initial state of the next subtask, and predicts a reward of the execution device in the subtask initial state of the next subtask; determining a specified input to the SNN for the specified subtask, wherein the specified input comprises a specified sequence of actions taken by the execution device and by the one or more other devices that reach the subtask initial state of the specified subtask; and determining an action selection policy for completing the specified subtask based on an output of the SNN for the specified subtask with the specified input to the SNN for the specified subtask. 2. The method of claim 1 , further comprising controlling operations of the execution device in the specified subtask according to the action selection policy for completing the specified subtask. 3. The method of claim 1 , further comprising: determining another subtask in the sequence of subtasks that has another complexity below the threshold; and determining another action selection policy for completing the another subtask by performing a tabular counterfactual regret minimization (CFR) algorithm to the another subtask. 4. The method of claim 1 , further comprising: obtaining another SNN for another subtask in the sequence of subtasks that has a second complexity exceeding a second threshold, wherein the another subtask is behind the specified subtask in the sequence of subtasks, and the another SNN for the another subtask is trained independently from the SNN for the specified subtask; and determining another action selection policy for completing the another subtask by inputting, into the another SNN for the another subtask, another sequence of actions taken by the execution device and by the one or more other devices that reach an initial state of the another subtask from the task initial state, wherein the another sequence of actions comprises the sequence of actions. 5. The method of claim 1 , further comprising determining an overall action selection policy for completing the task by determining a respective action selection policy for each of the subtask according to an order of the sequence of subtasks from a first subtask that comprises the task initial state to the last subtask that comprises the plurality of task terminal states. 6. The method of claim 1 , wherein the SNN for the specified subtask is trained based on the VNN for the next subtask by: predicting a plurality of rewards in the plurality of subtask terminal states of the specified subtask based on the VNN for the next subtask; and training the SNN for the specified subtask based on the plurality of rewards in the plurality of subtask terminal states of the specified subtask according to a neural-network-based CFR algorithm. 7. A computer-implemented system for generating an action selection policy causing an execution device to complete a task in an environment that includes the execution device and one or more other devices, the computer-implemented system comprising: one or more processors; and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform operations comprising: obtaining data representing a task that is divided into a sequence of subtasks, wherein: the task comprises a task initial state, a plurality of non-task-terminal states, and a plurality of task terminal states, wherein each of the plurality of the task terminal states results from a sequence of actions taken by the execution device and by the one or more other devices in a subset of the plurality of non-task-terminal states, and the plurality of the task terminal states have respective rewards in the task terminal states, each subtask in the sequence of subtasks comprises one or more subtask initial states and a plurality of subtask terminal states of the subtask, and except for a last subtask in the sequence of subtasks, the plurality of subtask terminal states of the subtask are a plurality of subtask initial states of a next subtask that follows the subtask in the sequence of subtasks; in response to determining that a specified subtask in the sequence of subtasks has a complexity that exceeds a threshold, obtaining data specifying a strategy neural network (SNN) for the specified subtask in the sequence of subtasks, wherein the SNN for the specified subtask receives inputs comprising a sequence of actions taken by the execution device and by the one or more other devices that reach a subtask initial state of the specified subtask, and predicts an action selection policy of the execution device for the specified subtask, wherein the SNN for the specified subtask is trained based on a value neural network (VNN) for a next subtask that follows the specified subtask in the sequence of subtasks, wherein the VNN for the next subtask receives inputs comprising reach probabilities of the execution device and the one or more other devices reaching a subtask initial state of the next subtask, and predicts a reward of the execution device in the subtask initial state of the next subtask; determining a specified input to the SNN for the specified subtask, wherein the specified input comprises a specified sequence of actions taken by the execution device and by the one or more other devices that reach the subtask initial state of the specified subtask; and determining an action selection policy for completing the specified subtask based on an output of the SNN for the specified subtask with the specified input to the SNN for the specified subtask. 8. The computer-implemented system of claim
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Combinations of networks · CPC title
Learning methods · CPC title
Feedforward networks · CPC title
Reinforcement learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.