What technology area does this patent fall under?

Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.

When was this patent published?

Publication date Tue Mar 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Device and method to improve learning of a policy for robots

US12246450B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12246450-B2
Application number	US-202217652983-A
Country	US
Kind code	B2
Filing date	Mar 1, 2022
Priority date	Mar 16, 2021
Publication date	Mar 11, 2025
Grant date	Mar 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for for learning a policy. The method includes: recording at least an episode of interactions of the agent with its environment following policy and adding the recorded episode to a set of training data; optimizing a transition dynamics model based on the training data such that the transition dynamics model predicts the next states of the environment depending on the states and actions contained in the training data; optimizing policy parameters based on the training data and the transition dynamics model by optimizing a reward. In the method, the transition dynamics model comprises a first model characterizing the global model and a second model characterizing a correction model, which is configured to correct outputs of the first model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is optimized such that, when actions are selected as done when recording the episodes of the training data, then a sequence of states predicted by the transition dynamics model will be equal to the recorded states of the training data. 2. The method according to claim 1 , wherein the correction model is selected by minimizing a difference between an output of the correction model and a difference between the recorded state of the training data and the predicted state by the first model. 3. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is dependent on a state or a time, wherein the time characterizes a time span elapsed since a beginning of a respective episode. 4. The method according to claim 3 , wherein the environment is deterministic and the correction model is dependent on the time. 5. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is a probabilistic function, and wherein the probabilistic function is optimized by approximate inference. 6. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is optimized jointly with the first model. 7. The method according to claim 6 , wherein the agent is an at least partially autonomous robot and/or a manufacturing machine and/or an access control system. 8. The method according to claim 6 , carrying out the determined action, by the agent, wherein for optimizing the transition dynamics model, after optimizing the first model on the training data, the correction model is selected such that error of the first model is minimized for actions selected from the policy on the training data. 9. A machine-readable storage medium on which is stored a computer program for learning a policy for an agent, the computer program, when executed by a computer, causing the computer to perform the following steps: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameter

Assignees

Bosch Gmbh Robert

Inventors

Classifications

G06N3/092
Reinforcement learning · CPC title
G06F17/17
Function evaluation by approximation methods, e.g. inter- or extrapolation, smoothing, least mean square method ({G06F17/18 takes precedence } ; interpolation for numerical control G05B19/18) · CPC title
G06N3/045
Combinations of networks · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G05B13/0265
the criterion being a learning criterion · CPC title

Patent family

Related publications grouped by family.

View patent family 74946996

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12246450B2 cover?: A computer-implemented method for for learning a policy. The method includes: recording at least an episode of interactions of the agent with its environment following policy and adding the recorded episode to a set of training data; optimizing a transition dynamics model based on the training data such that the transition dynamics model predicts the next states of the environment depending on …
Who is the assignee on this patent?: Bosch Gmbh Robert
What technology area does this patent fall under?: Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.
When was this patent published?: Publication date Tue Mar 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Leveraging dynamical priors for symbolic mappings in safe reinforcement learning

System and Method for Robust Optimization for Trajectory-Centric ModelBased Reinforcement Learning

Monitored machine performance as a maintenance predictor

Frequently asked questions