Device and method to improve learning of a policy for robots

US12246450B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12246450-B2
Application numberUS-202217652983-A
CountryUS
Kind codeB2
Filing dateMar 1, 2022
Priority dateMar 16, 2021
Publication dateMar 11, 2025
Grant dateMar 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for for learning a policy. The method includes: recording at least an episode of interactions of the agent with its environment following policy and adding the recorded episode to a set of training data; optimizing a transition dynamics model based on the training data such that the transition dynamics model predicts the next states of the environment depending on the states and actions contained in the training data; optimizing policy parameters based on the training data and the transition dynamics model by optimizing a reward. In the method, the transition dynamics model comprises a first model characterizing the global model and a second model characterizing a correction model, which is configured to correct outputs of the first model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is optimized such that, when actions are selected as done when recording the episodes of the training data, then a sequence of states predicted by the transition dynamics model will be equal to the recorded states of the training data. 2. The method according to claim 1 , wherein the correction model is selected by minimizing a difference between an output of the correction model and a difference between the recorded state of the training data and the predicted state by the first model. 3. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is dependent on a state or a time, wherein the time characterizes a time span elapsed since a beginning of a respective episode. 4. The method according to claim 3 , wherein the environment is deterministic and the correction model is dependent on the time. 5. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is a probabilistic function, and wherein the probabilistic function is optimized by approximate inference. 6. A computer-implemented method for operating an agent depending on a learned policy obtained by: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameters of the policy based on the training data and the transition dynamics model by optimizing a reward over at least one episode by following the policy; wherein the transition dynamics model includes a first model representing a learned model of the environment and a correction model which is configured to correct errors of the first model; wherein the method includes: sensing the environment using a sensor of the agent; determining a current state depending on the sensed environment; determining, using the learning policy, an action for the agent depending on the current state; carrying out the determined action, by the agent, wherein the correction model is optimized jointly with the first model. 7. The method according to claim 6 , wherein the agent is an at least partially autonomous robot and/or a manufacturing machine and/or an access control system. 8. The method according to claim 6 , carrying out the determined action, by the agent, wherein for optimizing the transition dynamics model, after optimizing the first model on the training data, the correction model is selected such that error of the first model is minimized for actions selected from the policy on the training data. 9. A machine-readable storage medium on which is stored a computer program for learning a policy for an agent, the computer program, when executed by a computer, causing the computer to perform the following steps: initializing a policy and a transition dynamics model which predicts a next state of an environment and/or of the agent; and repeating the following steps until a termination condition is fulfilled: recording at least an episode of interactions of the agent with the environment following the policy and adding the recorded episode to a set of training data, optimizing the transition dynamics model based on the training data such that the transition dynamics model predicts next states of the environment depending on states and actions contained in the training data, and optimizing policy parameter

Assignees

Inventors

Classifications

  • Reinforcement learning · CPC title

  • Function evaluation by approximation methods, e.g. inter- or extrapolation, smoothing, least mean square method ({G06F17/18 takes precedence } ; interpolation for numerical control G05B19/18) · CPC title

  • Combinations of networks · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • the criterion being a learning criterion · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12246450B2 cover?
A computer-implemented method for for learning a policy. The method includes: recording at least an episode of interactions of the agent with its environment following policy and adding the recorded episode to a set of training data; optimizing a transition dynamics model based on the training data such that the transition dynamics model predicts the next states of the environment depending on …
Who is the assignee on this patent?
Bosch Gmbh Robert
What technology area does this patent fall under?
Primary CPC classification B25J9/163. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Tue Mar 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).