Data-efficient hierarchical reinforcement learning
US-2021187733-A1 · Jun 24, 2021 · US
US12354027B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12354027-B2 |
| Application number | US-201815943947-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 3, 2018 |
| Priority date | Apr 3, 2018 |
| Publication date | Jul 8, 2025 |
| Grant date | Jul 8, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and system for teaching an artificial intelligent agent where the agent can be placed in a state that it would like it to learn how to achieve. By giving the agent several examples, it can learn to identify what is important about these example states. Once the agent has the ability to recognize a goal configuration, it can use that information to then learn how to achieve the goal states on its own. An agent may be provided with positive and negative examples to demonstrate a goal configuration. Once the agent has learned certain goal configurations, the agent can learn policies and skills that achieve the learned goal configuration. The agent may create a collection of these policies and skills from which to select based on a particular command or state.
Opening claim text (preview).
What is claimed is: 1. A method for training an artificial intelligent agent to recognize a goal configuration, comprising: placing the agent in the goal configuration and identifying a resulting state as a positive example; providing negative examples to the agent that demonstrate the agent in a state failing to achieve the goal configuration; extracting key state features when the agent is in the goal configuration, the key state features including at least one of a room feature, object positioning, ambient lighting, and ambient sounds; determining what feature categories are important in the goal configuration during receipt of positive examples to the agent; learning and recognizing, by the agent, the goal configuration based on the extracted key state features and the determined important feature categories; creating policies, by the agent, based on the learned goal configuration; converting state features into a distance function to determine how far the agent is from the goal configuration; using goal detection as a final reward; and using a goal distance as an intermediate reward. 2. The method of claim 1 , wherein an interface is used to indicate an example as being either the positive example or the negative example, the interface includes at least one of a spoken word received by the agent, an electronic signal received from a computing device, and a physical button on the agent. 3. The method of claim 1 , wherein the step of extracting key state features includes looking for similarity in state features in each of the positive and negative examples. 4. The method of claim 3 , further comprising increasing a confidence of the agent as the positive and negative examples are received by the agent. 5. The method of claim 4 , wherein the agent takes an action upon reaching a predetermined level of confidence. 6. The method of claim 1 , wherein the key state features are weighted according to a predetermined weight value. 7. The method of claim 1 , further comprising asking, by the agent, for human feedback regarding whether the agent is in a goal state. 8. A system comprising a processor and a computer-usable medium embodying a computer program code, the computer program code comprising instructions executable by the processor and configured to provide a method of learning to recognize a goal configuration of an artificial agent, the method comprising: placing the agent in the goal configuration and identifying a resulting state as a positive example; providing negative examples to the agent that demonstrate the agent in a state failing to achieve the goal configuration; extracting key state features when the agent is in the goal configuration, the key state features including at least one of a room feature, object positioning, ambient lighting, and ambient sounds; determining what feature categories are important in the goal configuration during receipt of positive examples to the agent; learning and recognizing, by the agent, the goal configuration based on the extracted key state features and the determined important feature categories; creating policies, by the agent, based on the learned goal configuration; converting state features into a distance function to determine how far the agent is from the goal configuration; using the distance function as an intermediate reward for the agent; and using goal detection as a final reward. 9. The system of claim 8 , wherein the method further comprises recognizing whether the agent is in an initialization state. 10. The system of claim 8 , wherein the method further comprises self-practice by the agent. 11. The system of claim 10 , wherein a selected goal configuration for self-practice is selected based on at least one of a random determination, which goal configuration needs the most improvement, which goal configuration is most likely to improve, which goal configuration has been used least recently, and which goal configuration is most used. 12. The system of claim 10 , wherein the method further comprises biasing action choices based on which actions are important to achieving the goal configuration. 13. The system of claim 8 , wherein the method further comprises updating a policy for achieving a goal configuration based on performance of the agent. 14. A non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program instructs one or more processors to perform the following steps to cause an agent to recognize and learn a goal configuration: placing the agent in the goal configuration and identifying a resulting state as a positive example; providing negative examples to the agent that demonstrate the agent in a state failing to achieve the goal configuration; extracting key state features when the agent is in the goal configuration, the key state features including at least one of a room feature, object positioning, ambient lighting, and ambient sounds; determining what feature categories are important in the goal configuration during receipt of positive examples to the agent; learning and recognizing, by the agent, the goal configuration based on the extracted key state features and the determined important feature categories; creating policies, by the agent, based on the learned goal configuration; converting state features into a distance function to determine how far the agent is from the goal configuration; using the distance function as an intermediate reward for the agent; and using goal detection as a final reward. 15. The non-transitory computer-readable storage medium of claim 14 , wherein the step of extracting key state features includes looking for similarity of the key state features in each of the positive and negative examples. 16. The non-transitory computer-readable storage medium of claim 14 , wherein the program instructs one or more processors to perform the following steps: increasing a confidence of the agent as additional ones of the positive and negative examples are received by the agent; and taking an action by the agent upon reaching a predetermined level of confidence.
Related publications grouped by family.
Answers are generated from the same data shown on this page.