Entropy-based techniques for improved automated selection in computer-based reasoning systems
US-11880775-B1 · Jan 23, 2024 · US
US2022019866A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022019866-A1 |
| Application number | US-201917297902-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 2, 2019 |
| Priority date | Nov 30, 2018 |
| Publication date | Jan 20, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a policy neural network having policy parameters. One of the methods includes obtaining trajectory data comprising one or more tuples; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both (i) a reward term and (ii) an entropy term, wherein a relative weight between the entropy term and the reward term in the maximization is determined by a temperature parameter; and updating, using the probability distributions defined by the policy outputs generated in accordance with the current values of the policy parameters for the tuples in the trajectory data, the temperature parameter to regulate an expected entropy of the probability distributions to at least equal a minimum expected entropy value.
Opening claim text (preview).
1 . A method of training a policy neural network having a plurality of policy parameters and used to control a robot interacting with an environment, wherein the policy neural network is configured to receive as input a state representation characterizing a state of the environment and to process the state representation in accordance with the policy parameters to generate a policy output that defines a probability distribution over a set of actions that can be performed by the robot, the method comprising: obtaining trajectory data comprising one or more tuples, each tuple identifying a state representation characterizing a state of the environment, an action performed by the robot when the environment was in the state characterized by the state representation, a reward received in response to the robot performing the action, and a next state representation characterizing a next state of the environment after the robot performed the action; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both (i) a reward term that measures total rewards in the tuples in the trajectory data and (ii) an entropy term that measures an entropy of probability distributions defined by policy outputs generated by processing the state representations in the tuples in the trajectory data in accordance with the current values of the policy parameters, wherein a relative weight between the entropy term and the reward term in the maximization is determined by a temperature parameter; and updating, using the probability distributions defined by the policy outputs generated in accordance with the current values of the policy parameters for the tuples in the trajectory data, the temperature parameter to regulate an expected entropy of the probability distributions to at least equal a minimum expected entropy value. 2 . The method of claim 1 , further comprising: controlling the robot using the policy neural network and in accordance with the updated values of the policy parameters. 3 . The method of claim 1 , wherein updating the temperature parameter comprises updating the temperature parameter using an objective function that depends on, for each of the probability distributions, the temperature parameter, the entropy of the probability distribution, and the minimum expected entropy value. 4 . The method of claim 3 , wherein the minimum expected entropy value is based on a number of action dimensions in the actions in the possible set of actions. 5 . The method of claim 4 , wherein the minimum expected entropy value is a negative of the number of action dimensions. 6 . The method of claim 3 , wherein updating the temperature parameter comprises: determining, for each of the one more tuples, a gradient with respect to the temperature parameter of the objective function; and updating the temperature parameter using the determined gradients. 7 . The method of claim 6 , wherein determining the gradient comprises, for each of the tuples: sampling an action from the probability distribution generated by the policy neural network for the tuple; and determining a difference between (i) a negative of a logarithm of the probability assigned to the sampled action by the probability distribution and (ii) the minimum expected entropy value. 8 . The method of claim 3 , wherein the objective function satisfies: J (α)= E a t ˜π t [−α log π t ( a t |s t )− α H ], where α is the temperature parameter, E is the expectation operator, a t is an action sampled from the probability distribution generated by the policy neural network π t for a t-th tuple by processing the state representation s t in the tuple in accordance with the current values of the policy parameters, and H is the minimum expected entropy value. 9 . (canceled) 10 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to control a robot interacting with an environment, wherein the policy neural network is configured to receive as input a state representation characterizing a state of the environment and to process the state representation in accordance with the policy parameters to generate a policy output that defines a probability distribution over a set of actions that can be performed by the robot, the operations comprising: obtaining trajectory data comprising one or more tuples, each tuple identifying a state representation characterizing a state of the environment, an action performed by the robot when the environment was in the state characterized by the state representation, a reward received in response to the robot performing the action, and a next state representation characterizing a next state of the environment after the robot performed the action; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both (i) a reward term that measures total rewards in the tuples in the trajectory data and (ii) an entropy term that measures an entropy of probability distributions defined by policy outputs generated by processing the state representations in the tuples in the trajectory data in accordance with the current values of the policy parameters, wherein a relative weight between the entropy term and the reward term in the maximization is determined by a temperature parameter; and updating, using the probability distributions defined by the policy outputs generated in accordance with the current values of the policy parameters for the tuples in the trajectory data, the temperature parameter to regulate an expected entropy of the probability distributions to at least equal a minimum expected entropy value. 11 . The system of claim 10 , the operations further comprising: controlling the robot using the policy neural network and in accordance with the updated values of the policy parameters. 12 . The system of claim 10 , wherein updating the temperature parameter comprises updating the temperature parameter using an objective function that depends on, for each of the probability distributions, the temperature parameter, the entropy of the probability distribution, and the minimum expected entropy value. 13 . The system of claim 12 , wherein the minimum expected entropy value is based on a number of action dimensions in the actions in the possible set of actions. 14 . The system of claim 13 , wherein the minimum expected entropy value is a negative of the number of action dimensions. 15 . The system of claim 12 , wherein updating the temperature parameter comprises: determining, for each of the one more tuples, a gradient with respect to the temperature parameter of the objective function; and updating the temperature parameter using the determined gradients. 16 . The system of claim 15 , wherein determining the gradient comprises, for each of the tuples: sampling an action from the probability distribution generated by the policy neural network for the tuple; and determining a difference between (i) a negative of a logarithm of the probability assigned to the sampled action by the probability distribution and (ii) the minimum expected entropy value. 17 . The system of claim 12 , wherein the objective function satisfies: J (α)= E a t ˜π t [−α log π t
Convolutional networks [CNN, ConvNet] · CPC title
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
Reinforcement learning · CPC title
based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.