Semi-supervised reinforcement learning

US11645498B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11645498-B2
Application numberUS-201916582092-A
CountryUS
Kind codeB2
Filing dateSep 25, 2019
Priority dateSep 25, 2019
Publication dateMay 9, 2023
Grant dateMay 9, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is a method, a system, and a program product for determining a policy using semi-supervised reinforcement learning. The method includes observing a state of an environment by a learning agent. The method also includes taking an action by the learning agent. The method further includes observing a new state of the environment and calculating a reward for the action taken by the learning agent. The method also includes determining whether a policy related to the learning agent should be changed. The determination is conducted by a teaching agent that inputs the state of the environment and the reward as features. The method can also include changing the policy related to the learning agent upon a determination that a label outputted by the teaching agent exceeds a reward threshold.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for determining a policy related to moving objects in an environment, the computer-implemented method comprising: observing a state of an object in an environment by a learning agent; taking an action, by the learning agent, based on a task assigned to the learning agent and on the state of the object; observing a new state of the object by the learning agent; calculating a reward for the action taken by the learning agent; and providing the reward, the new state of the object, and a velocity vector and an acceleration vector associated with the object to a teaching agent, wherein the teaching agent produces a label using features of the reward, the new state of the object, the velocity vector, and the acceleration vector, and wherein the teaching agent changes a policy relating to the learning agent to correspond to the action taken by the learning agent in response to the label exceeding a reward threshold. 2. The method of claim 1 , further comprising: retaining the policy related to the learning agent upon a determination that the label produced by the teaching agent does not exceed the reward threshold. 3. The method of claim 1 , further comprising: inputting the new state of the object and the reward to a neural network; outputting, by the neural network, a probability for the action taken by the learning agent; and providing the probability to the learning agent for analysis. 4. The method of claim 3 , wherein the neural network includes an unbiased layer and a biased layer to regulate overfitting and underfitting. 5. The method of claim 1 , further comprising: detecting, by a sensor, a velocity and an acceleration related to the object within the environment; generating the velocity vector related to the velocity and the acceleration vector related to the acceleration; and providing the teaching agent with the velocity vector and the acceleration vector for analysis. 6. The method of claim 1 , wherein the teaching agent is trained with a velocity feature vector and an acceleration feature vector, wherein the velocity feature vector and the acceleration feature vector relate to the object within the environment and produced by an IoT device. 7. The method of claim 1 , further comprising: determining a criticality of situation by a natural language processor; and providing the teaching agent with the criticality of situation. 8. The method of claim 7 , wherein the criticality of situation is another feature used by the teaching agent to produce the label. 9. The method of claim 7 , wherein the natural language processor determines the criticality of situation by analyzing an input from an administrator. 10. A system comprising: a learning agent configured to take an action based on a state of an object in an environment and a policy related to the learning agent, wherein the action produces a new state of the object in the environment; a neural network configured to produce a probability for the action taken, wherein the new state of the object and a reward related to the action are inputted into the neural network to produce the probability; a teaching agent configured to produce a label using features of the new state of the object, the reward, a velocity vector, and an acceleration vector associated with the object to produce the label, wherein the teaching agent changes the policy related to the learning agent to correspond to the action taken by the learning agent in response to the label exceeding a reward threshold; a remote sensor configured to detect the object within the environment and determine the velocity and the acceleration related to the object, the remote sensor further configured to provide the teaching agent with the features of the velocity vector and the acceleration vector related to the velocity and the acceleration respectively; and a natural language processor configured to analyze speech and text provided by an administrator observing the action taken by the learning agent, the natural language processor further configured to provide the teaching agent with an output produced. 11. The system of claim 10 , wherein the neural network includes an unbiased layer and a biased layer, wherein the unbiased layer and the biased layer regulate overfitting and underfitting. 12. The system of claim 11 , wherein the teaching agent is configured to input the probability, based on the unbiased layer and the biased layer, as an additional feature in determining the label. 13. The system of claim 10 , wherein the teaching agent is initially trained with a labeled data in a controlled environment. 14. The system of claim 10 , wherein the natural language processor is configured to perform a sentiment analysis on the speech and provide a sentiment level related to the sentiment analysis to the teaching agent. 15. The system of claim 14 , wherein the teaching agent is configured to input the sentiment level as an additional feature in determining the label. 16. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: observe a state of an object in an environment by a learning agent; take an action, by the learning agent, based on a task assigned to the learning agent and on the state of the object; observe a new state of the object by the learning agent; calculate a reward for the action taken by the learning agent; and provide the reward, the new state of the object, and a velocity vector and an acceleration vector associated with the object to a teaching agent, wherein the teaching agent produces a label using features of the reward, the new state of the object, the velocity vector, and the acceleration vector, and wherein the teaching agent changes a policy related to the learning agent to correspond to the action taken by the learning agent in response to the label exceeding a reward threshold. 17. The computer program product of claim 16 , wherein the program instructions further cause the processor to: input the new state of the object and the reward to a neural network; output, by the neural network, a probability for the action taken by the learning agent; and provide the probability to the learning agent for analysis. 18. The computer program product of claim 16 , wherein the program instructions further cause the processor to: detect, by an Internet of Things (IoT) sensor, a velocity and an acceleration related to the object within the environment; generate the velocity vector related to the velocity and the acceleration vector related to the acceleration; and provide the teaching agent with the velocity vector and the acceleration vector for analysis.

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Reinforcement learning · CPC title

  • Feedforward networks · CPC title

  • using artificial neural networks · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11645498B2 cover?
Provided is a method, a system, and a program product for determining a policy using semi-supervised reinforcement learning. The method includes observing a state of an environment by a learning agent. The method also includes taking an action by the learning agent. The method further includes observing a new state of the environment and calculating a reward for the action taken by the learning…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 09 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).