Data-efficient hierarchical reinforcement learning

US12479093B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12479093-B2
Application numberUS-202418673510-A
CountryUS
Kind codeB2
Filing dateMay 24, 2024
Priority dateMay 18, 2018
Publication dateNov 25, 2025
Grant dateNov 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Training and/or utilizing a hierarchical reinforcement learning (HRL) model for robotic control. The HRL model can include at least a higher-level policy model and a lower-level policy model. Some implementations relate to technique(s) that enable more efficient off-policy training to be utilized in training of the higher-level policy model and/or the lower-level policy model. Some of those implementations utilize off-policy correction, which re-labels higher-level actions of experience data, generated in the past utilizing a previously trained version of the HRL model, with modified higher-level actions. The modified higher-level actions are then utilized to off-policy train the higher-level policy model. This can enable effective off-policy training despite the lower-level policy model being a different version at training time (relative to the version when the experience data was collected).

First claim

Opening claim text (preview).

We claim: 1. A method implemented by one or more processors of a robot, the method comprising: at a first control step: identifying a first current state observation; determining, using a higher-level policy model, a higher-level action for transitioning from the first current state to a goal state; generating a first lower-level action based on processing, using a lower-level policy model, the first current state observation and the higher-level action; and applying the first lower-level action to cause the robot to transition to an updated state; at a second control step: identifying a second current state observation; generating a second lower-level action based on processing, using the lower-level policy model, the second current state observation and the higher-level action that was determined in the first control step; and applying the second lower-level action to cause the robot to transition to a further updated state; and at a subsequent control step that is subsequent to the second control step: identifying a subsequent current state observation; determining, using the higher-level policy model, a subsequent higher-level action for transitioning from the subsequent current state to the goal state; generating a subsequent lower-level action based on processing, using the lower-level policy model, the subsequent current state observation and the subsequent higher-level action; and applying the subsequent lower-level action to cause the robot to transition to a subsequent updated state. 2. The method of claim 1 , wherein the subsequent control step is immediately subsequent to the second control step. 3. The method of claim 2 , further comprising: generating a first intrinsic reward for the first lower-level action, the first intrinsic reward generated based on the updated state and the goal state; generating a second intrinsic reward for the second lower-level action, the second intrinsic reward generated based on the further updated state and the goal state; and training the lower-level policy model based on the first and second intrinsic rewards. 4. The method of claim 3 , wherein: generating the first intrinsic reward based on the updated state and the goal state comprises generating the first intrinsic reward based on an L2 difference between the updated state and the goal state observation; and generating the second intrinsic reward based on the further updated state and the goal state comprises generating the second intrinsic reward based on an L2 difference between the further updated state and the goal state observation. 5. The method of claim 1 , wherein the first lower-level action includes torques that are applied to actuators of the robot to cause the robot to transition to the updated state. 6. The method of claim 5 , wherein the higher-level action is a robotic state differential indicating the goal state. 7. The method of claim 1 , wherein the higher-level action is a robotic state differential indicating the goal state. 8. The method of claim 1 , wherein the first lower-level action includes one or more commands that are directly applied to actuators of the robot to cause the robot to transition to the updated state. 9. A robot comprising: one or more vision components; one or more actuators; memory storing instructions; one or more processors operable to execute the instructions to: at a first control step: identify a first current state observation; determine, using a higher-level policy model, a higher-level action for transitioning from the first current state to a goal state; generate a first lower-level action based on processing, using a lower-level policy model, the first current state observation and the higher-level action; and apply the first lower-level action to one or more of the actuators to cause the robot to transition to an updated state; at a second control step: identify a second current state observation; generate a second lower-level action based on processing, using the lower-level policy model, the second current state observation and the higher-level action that was determined in the first control step; and apply the second lower-level action to one or more of the actuators to cause the robot to transition to a further updated state; and at a subsequent control step that is subsequent to the second control step: identify a subsequent current state observation; determine, using the higher-level policy model, a subsequent higher-level action for transitioning from the subsequent current state to the goal state; generate a subsequent lower-level action based on processing, using the lower-level policy model, the subsequent current state observation and the subsequent higher-level action; and apply the subsequent lower-level action to one or more of the actuators to cause the robot to transition to a subsequent updated state. 10. The robot of claim 9 , wherein the subsequent control step is immediately subsequent to the second control step. 11. The robot of claim 9 , wherein one or more of the processors are further operable to execute the instructions to: generate a first intrinsic reward for the first lower-level action, the first intrinsic reward generated based on the updated state and the goal state; generate a second intrinsic reward for the second lower-level action, the second intrinsic reward generated based on the further updated state and the goal state; and train the lower-level policy model based on the first and second intrinsic rewards. 12. The robot of claim 11 , wherein: in generating the first intrinsic reward based on the updated state and the goal state one or more of the processors are to generate the first intrinsic reward based on an L2 difference between the updated state and the goal state observation; and in generating the second intrinsic reward based on the further updated state and the goal state one or more of the processors are to generate the second intrinsic reward based on an L2 difference between the further updated state and the goal state observation. 13. The robot of claim 9 , wherein the first lower-level action includes torques. 14. The robot of claim 13 , wherein the higher-level action is a robotic state differential indicating the goal state. 15. The robot of claim 9 , wherein the higher-level action is a robotic state differential indicating the goal state. 16. The robot of claim 9 , wherein the first lower-level action includes one or more commands that are directly applied to one or more of the actuators.

Assignees

Inventors

Classifications

  • G06N3/092Primary

    Reinforcement learning · CPC title

  • Feedforward networks · CPC title

  • Learning methods · CPC title

  • Machine learning · CPC title

  • based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12479093B2 cover?
Training and/or utilizing a hierarchical reinforcement learning (HRL) model for robotic control. The HRL model can include at least a higher-level policy model and a lower-level policy model. Some implementations relate to technique(s) that enable more efficient off-policy training to be utilized in training of the higher-level policy model and/or the lower-level policy model. Some of those imp…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).