Data-efficient hierarchical reinforcement learning
US-11992944-B2 · May 28, 2024 · US
US12479093B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12479093-B2 |
| Application number | US-202418673510-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 24, 2024 |
| Priority date | May 18, 2018 |
| Publication date | Nov 25, 2025 |
| Grant date | Nov 25, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Training and/or utilizing a hierarchical reinforcement learning (HRL) model for robotic control. The HRL model can include at least a higher-level policy model and a lower-level policy model. Some implementations relate to technique(s) that enable more efficient off-policy training to be utilized in training of the higher-level policy model and/or the lower-level policy model. Some of those implementations utilize off-policy correction, which re-labels higher-level actions of experience data, generated in the past utilizing a previously trained version of the HRL model, with modified higher-level actions. The modified higher-level actions are then utilized to off-policy train the higher-level policy model. This can enable effective off-policy training despite the lower-level policy model being a different version at training time (relative to the version when the experience data was collected).
Opening claim text (preview).
We claim: 1. A method implemented by one or more processors of a robot, the method comprising: at a first control step: identifying a first current state observation; determining, using a higher-level policy model, a higher-level action for transitioning from the first current state to a goal state; generating a first lower-level action based on processing, using a lower-level policy model, the first current state observation and the higher-level action; and applying the first lower-level action to cause the robot to transition to an updated state; at a second control step: identifying a second current state observation; generating a second lower-level action based on processing, using the lower-level policy model, the second current state observation and the higher-level action that was determined in the first control step; and applying the second lower-level action to cause the robot to transition to a further updated state; and at a subsequent control step that is subsequent to the second control step: identifying a subsequent current state observation; determining, using the higher-level policy model, a subsequent higher-level action for transitioning from the subsequent current state to the goal state; generating a subsequent lower-level action based on processing, using the lower-level policy model, the subsequent current state observation and the subsequent higher-level action; and applying the subsequent lower-level action to cause the robot to transition to a subsequent updated state. 2. The method of claim 1 , wherein the subsequent control step is immediately subsequent to the second control step. 3. The method of claim 2 , further comprising: generating a first intrinsic reward for the first lower-level action, the first intrinsic reward generated based on the updated state and the goal state; generating a second intrinsic reward for the second lower-level action, the second intrinsic reward generated based on the further updated state and the goal state; and training the lower-level policy model based on the first and second intrinsic rewards. 4. The method of claim 3 , wherein: generating the first intrinsic reward based on the updated state and the goal state comprises generating the first intrinsic reward based on an L2 difference between the updated state and the goal state observation; and generating the second intrinsic reward based on the further updated state and the goal state comprises generating the second intrinsic reward based on an L2 difference between the further updated state and the goal state observation. 5. The method of claim 1 , wherein the first lower-level action includes torques that are applied to actuators of the robot to cause the robot to transition to the updated state. 6. The method of claim 5 , wherein the higher-level action is a robotic state differential indicating the goal state. 7. The method of claim 1 , wherein the higher-level action is a robotic state differential indicating the goal state. 8. The method of claim 1 , wherein the first lower-level action includes one or more commands that are directly applied to actuators of the robot to cause the robot to transition to the updated state. 9. A robot comprising: one or more vision components; one or more actuators; memory storing instructions; one or more processors operable to execute the instructions to: at a first control step: identify a first current state observation; determine, using a higher-level policy model, a higher-level action for transitioning from the first current state to a goal state; generate a first lower-level action based on processing, using a lower-level policy model, the first current state observation and the higher-level action; and apply the first lower-level action to one or more of the actuators to cause the robot to transition to an updated state; at a second control step: identify a second current state observation; generate a second lower-level action based on processing, using the lower-level policy model, the second current state observation and the higher-level action that was determined in the first control step; and apply the second lower-level action to one or more of the actuators to cause the robot to transition to a further updated state; and at a subsequent control step that is subsequent to the second control step: identify a subsequent current state observation; determine, using the higher-level policy model, a subsequent higher-level action for transitioning from the subsequent current state to the goal state; generate a subsequent lower-level action based on processing, using the lower-level policy model, the subsequent current state observation and the subsequent higher-level action; and apply the subsequent lower-level action to one or more of the actuators to cause the robot to transition to a subsequent updated state. 10. The robot of claim 9 , wherein the subsequent control step is immediately subsequent to the second control step. 11. The robot of claim 9 , wherein one or more of the processors are further operable to execute the instructions to: generate a first intrinsic reward for the first lower-level action, the first intrinsic reward generated based on the updated state and the goal state; generate a second intrinsic reward for the second lower-level action, the second intrinsic reward generated based on the further updated state and the goal state; and train the lower-level policy model based on the first and second intrinsic rewards. 12. The robot of claim 11 , wherein: in generating the first intrinsic reward based on the updated state and the goal state one or more of the processors are to generate the first intrinsic reward based on an L2 difference between the updated state and the goal state observation; and in generating the second intrinsic reward based on the further updated state and the goal state one or more of the processors are to generate the second intrinsic reward based on an L2 difference between the further updated state and the goal state observation. 13. The robot of claim 9 , wherein the first lower-level action includes torques. 14. The robot of claim 13 , wherein the higher-level action is a robotic state differential indicating the goal state. 15. The robot of claim 9 , wherein the higher-level action is a robotic state differential indicating the goal state. 16. The robot of claim 9 , wherein the first lower-level action includes one or more commands that are directly applied to one or more of the actuators.
Reinforcement learning · CPC title
Feedforward networks · CPC title
Learning methods · CPC title
Machine learning · CPC title
based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.