Continual reinforcement learning with a multi-task agent
US-12154029-B2 · Nov 26, 2024 · US
US12430564B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12430564-B2 |
| Application number | US-202217684245-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 1, 2022 |
| Priority date | Mar 1, 2022 |
| Publication date | Sep 30, 2025 |
| Grant date | Sep 30, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A manipulation task may include operations performed by one or more manipulation entities on one or more objects. This manipulation task may be broken down into a plurality of sequential sub-tasks (policies). These policies may be fine-tuned so that a terminal state distribution of a given policy matches an initial state distribution of another policy that immediately follows the given policy within the plurality of policies. The fine-tuned plurality of policies may then be chained together and implemented within a manipulation environment.
Opening claim text (preview).
What is claimed is: 1. A method comprising, at a device: determining an initial state distribution of a second state-action policy, the initial state distribution including possible states of an environment immediately before the second state-action policy is implemented; fine-tuning a first state-action policy to match a terminal state distribution of the first state-action policy to the initial state distribution of the second state-action policy, the terminal state distribution including possible states of the environment immediately after the first state-action policy is implemented; implementing the fine-tuned first state-action policy and the second state-action policy in sequence, wherein a terminal state of the environment resulting from implementation of the first state-action policy is provided as an initial state of the environment to the second state-action policy. 2. The method of claim 1 , wherein the first state-action policy and the second state-action policy each describes one or more manipulations performed by one or more manipulation entities on one or more objects. 3. The method of claim 2 , wherein the one or more manipulation entities include one or more robotic manipulation devices. 4. The method of claim 2 , wherein the one or more manipulation entities include one or more vehicle manipulation devices. 5. The method of claim 2 , wherein the one or more objects include one or more components of a product being assembled. 6. The method of claim 2 , wherein the one or more objects include one or more components of a vehicle being controlled. 7. The method of claim 2 , wherein the initial state distribution of the second state-action policy identifies all possible states of the one or more manipulation entities and the one or more objects being manipulated immediately before the second state-action policy is implemented. 8. The method of claim 2 , wherein the terminal state distribution of the first state-action policy identifies all possible states of the one or more manipulation entities and the one or more objects being manipulated immediately after the first state-action policy is implemented. 9. The method of claim 1 , wherein the first state-action policy is adjusted so that the terminal state distribution of the first policy is within a predetermined threshold of the initial state distribution of the second state-action policy. 10. The method of claim 1 , wherein the fine-tuning is performed within a simulation of the environment. 11. The method of claim 1 , wherein the fine-tuning is performed within the environment. 12. The method of claim 1 , comprising implementing the chained policies within the environment. 13. A system comprising: a hardware processor of a device that is configured to: determine an initial state distribution of a second state-action policy, the initial state distribution including possible states of an environment immediately before the second state-action policy is implemented; fine-tune a first state-action policy to match a terminal state distribution of the first state-action policy to the initial state distribution of the second state-action policy, the terminal state distribution including possible states of the environment immediately after the first state-action policy is implemented; implement the fine-tuned first state-action policy and the second state-action policy in sequence, wherein a terminal state of the environment resulting from implementation of the first state-action policy is provided as an initial state of the environment to the second state-action policy. 14. The system of claim 13 , wherein the first state-action policy and the second state-action policy each describes one or more manipulations performed by one or more manipulation entities on one or more objects. 15. The system of claim 14 , wherein the one or more manipulation entities include one or more robotic manipulation devices. 16. The system of claim 14 , wherein the one or more manipulation entities include one or more vehicle manipulation devices. 17. The system of claim 14 , wherein the one or more objects include one or more components of a product being assembled. 18. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor of a device, causes the processor to cause the device to: determine an initial state distribution of a second state-action policy, the initial state distribution including possible states of an environment immediately before the second state-action policy is implemented; fine-tune a first state-action policy to match a terminal state distribution of the first state-action policy to the initial state distribution of the second state-action policy, the terminal state distribution including possible states of the environment immediately after the first state-action policy is implemented; implement the fine-tuned first state-action policy and the second state-action policy in sequence, wherein a terminal state of the environment resulting from implementation of the first state-action policy is provided as an initial state of the environment to the second state-action policy. 19. The non-transitory computer-readable medium of claim 18 , wherein the first state-action policy is adjusted so that the terminal state distribution of the first policy is within a predetermined threshold of the initial state distribution of the second state-action policy.
characterised by modeling, simulation of the manufacturing system · CPC title
using automatic guided vehicles [AGV] (control of position or course of AGV's G05D1/00) · CPC title
characterised by job scheduling, process planning, material flow · CPC title
Reinforcement learning · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.