What technology area does this patent fall under?

Primary CPC classification G06N3/092. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Fine-tuning policies to facilitate chaining

US12430564B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12430564-B2
Application number	US-202217684245-A
Country	US
Kind code	B2
Filing date	Mar 1, 2022
Priority date	Mar 1, 2022
Publication date	Sep 30, 2025
Grant date	Sep 30, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A manipulation task may include operations performed by one or more manipulation entities on one or more objects. This manipulation task may be broken down into a plurality of sequential sub-tasks (policies). These policies may be fine-tuned so that a terminal state distribution of a given policy matches an initial state distribution of another policy that immediately follows the given policy within the plurality of policies. The fine-tuned plurality of policies may then be chained together and implemented within a manipulation environment.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, at a device: determining an initial state distribution of a second state-action policy, the initial state distribution including possible states of an environment immediately before the second state-action policy is implemented; fine-tuning a first state-action policy to match a terminal state distribution of the first state-action policy to the initial state distribution of the second state-action policy, the terminal state distribution including possible states of the environment immediately after the first state-action policy is implemented; implementing the fine-tuned first state-action policy and the second state-action policy in sequence, wherein a terminal state of the environment resulting from implementation of the first state-action policy is provided as an initial state of the environment to the second state-action policy. 2. The method of claim 1 , wherein the first state-action policy and the second state-action policy each describes one or more manipulations performed by one or more manipulation entities on one or more objects. 3. The method of claim 2 , wherein the one or more manipulation entities include one or more robotic manipulation devices. 4. The method of claim 2 , wherein the one or more manipulation entities include one or more vehicle manipulation devices. 5. The method of claim 2 , wherein the one or more objects include one or more components of a product being assembled. 6. The method of claim 2 , wherein the one or more objects include one or more components of a vehicle being controlled. 7. The method of claim 2 , wherein the initial state distribution of the second state-action policy identifies all possible states of the one or more manipulation entities and the one or more objects being manipulated immediately before the second state-action policy is implemented. 8. The method of claim 2 , wherein the terminal state distribution of the first state-action policy identifies all possible states of the one or more manipulation entities and the one or more objects being manipulated immediately after the first state-action policy is implemented. 9. The method of claim 1 , wherein the first state-action policy is adjusted so that the terminal state distribution of the first policy is within a predetermined threshold of the initial state distribution of the second state-action policy. 10. The method of claim 1 , wherein the fine-tuning is performed within a simulation of the environment. 11. The method of claim 1 , wherein the fine-tuning is performed within the environment. 12. The method of claim 1 , comprising implementing the chained policies within the environment. 13. A system comprising: a hardware processor of a device that is configured to: determine an initial state distribution of a second state-action policy, the initial state distribution including possible states of an environment immediately before the second state-action policy is implemented; fine-tune a first state-action policy to match a terminal state distribution of the first state-action policy to the initial state distribution of the second state-action policy, the terminal state distribution including possible states of the environment immediately after the first state-action policy is implemented; implement the fine-tuned first state-action policy and the second state-action policy in sequence, wherein a terminal state of the environment resulting from implementation of the first state-action policy is provided as an initial state of the environment to the second state-action policy. 14. The system of claim 13 , wherein the first state-action policy and the second state-action policy each describes one or more manipulations performed by one or more manipulation entities on one or more objects. 15. The system of claim 14 , wherein the one or more manipulation entities include one or more robotic manipulation devices. 16. The system of claim 14 , wherein the one or more manipulation entities include one or more vehicle manipulation devices. 17. The system of claim 14 , wherein the one or more objects include one or more components of a product being assembled. 18. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor of a device, causes the processor to cause the device to: determine an initial state distribution of a second state-action policy, the initial state distribution including possible states of an environment immediately before the second state-action policy is implemented; fine-tune a first state-action policy to match a terminal state distribution of the first state-action policy to the initial state distribution of the second state-action policy, the terminal state distribution including possible states of the environment immediately after the first state-action policy is implemented; implement the fine-tuned first state-action policy and the second state-action policy in sequence, wherein a terminal state of the environment resulting from implementation of the first state-action policy is provided as an initial state of the environment to the second state-action policy. 19. The non-transitory computer-readable medium of claim 18 , wherein the first state-action policy is adjusted so that the terminal state distribution of the first policy is within a predetermined threshold of the initial state distribution of the second state-action policy.

Assignees

Nvidia Corp

Inventors

Classifications

G05B19/41885
characterised by modeling, simulation of the manufacturing system · CPC title
G05B19/41895
using automatic guided vehicles [AGV] (control of position or course of AGV's G05D1/00) · CPC title
G05B19/41865
characterised by job scheduling, process planning, material flow · CPC title
G06N3/092Primary
Reinforcement learning · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

View patent family 87850446

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12430564B2 cover?: A manipulation task may include operations performed by one or more manipulation entities on one or more objects. This manipulation task may be broken down into a plurality of sequential sub-tasks (policies). These policies may be fine-tuned so that a terminal state distribution of a given policy matches an initial state distribution of another policy that immediately follows the given policy w…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Continual reinforcement learning with a multi-task agent

Cooperative multi-goal, multi-agent, multi-stage reinforcement learning

Training a policy model for a robotic task, using reinforcement learning and utilizing data that is based on episodes, of the robotic task, guided by an engineered policy

Device and method for controlling a robot

Maximum entropy regularised multi-goal reinforcement learning

Solving goal recognition using planning

Hybrid reward architecture for reinforcement learning

Frequently asked questions