Controlling robots using entropy constraints

US2022019866A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2022019866-A1
Application numberUS-201917297902-A
CountryUS
Kind codeA1
Filing dateDec 2, 2019
Priority dateNov 30, 2018
Publication dateJan 20, 2022
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a policy neural network having policy parameters. One of the methods includes obtaining trajectory data comprising one or more tuples; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both (i) a reward term and (ii) an entropy term, wherein a relative weight between the entropy term and the reward term in the maximization is determined by a temperature parameter; and updating, using the probability distributions defined by the policy outputs generated in accordance with the current values of the policy parameters for the tuples in the trajectory data, the temperature parameter to regulate an expected entropy of the probability distributions to at least equal a minimum expected entropy value.

First claim

Opening claim text (preview).

1 . A method of training a policy neural network having a plurality of policy parameters and used to control a robot interacting with an environment, wherein the policy neural network is configured to receive as input a state representation characterizing a state of the environment and to process the state representation in accordance with the policy parameters to generate a policy output that defines a probability distribution over a set of actions that can be performed by the robot, the method comprising: obtaining trajectory data comprising one or more tuples, each tuple identifying a state representation characterizing a state of the environment, an action performed by the robot when the environment was in the state characterized by the state representation, a reward received in response to the robot performing the action, and a next state representation characterizing a next state of the environment after the robot performed the action; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both (i) a reward term that measures total rewards in the tuples in the trajectory data and (ii) an entropy term that measures an entropy of probability distributions defined by policy outputs generated by processing the state representations in the tuples in the trajectory data in accordance with the current values of the policy parameters, wherein a relative weight between the entropy term and the reward term in the maximization is determined by a temperature parameter; and updating, using the probability distributions defined by the policy outputs generated in accordance with the current values of the policy parameters for the tuples in the trajectory data, the temperature parameter to regulate an expected entropy of the probability distributions to at least equal a minimum expected entropy value. 2 . The method of claim 1 , further comprising: controlling the robot using the policy neural network and in accordance with the updated values of the policy parameters. 3 . The method of claim 1 , wherein updating the temperature parameter comprises updating the temperature parameter using an objective function that depends on, for each of the probability distributions, the temperature parameter, the entropy of the probability distribution, and the minimum expected entropy value. 4 . The method of claim 3 , wherein the minimum expected entropy value is based on a number of action dimensions in the actions in the possible set of actions. 5 . The method of claim 4 , wherein the minimum expected entropy value is a negative of the number of action dimensions. 6 . The method of claim 3 , wherein updating the temperature parameter comprises: determining, for each of the one more tuples, a gradient with respect to the temperature parameter of the objective function; and updating the temperature parameter using the determined gradients. 7 . The method of claim 6 , wherein determining the gradient comprises, for each of the tuples: sampling an action from the probability distribution generated by the policy neural network for the tuple; and determining a difference between (i) a negative of a logarithm of the probability assigned to the sampled action by the probability distribution and (ii) the minimum expected entropy value. 8 . The method of claim 3 , wherein the objective function satisfies: J (α)= E a t ˜π t [−α log π t ( a t |s t )− α H ], where α is the temperature parameter, E is the expectation operator, a t is an action sampled from the probability distribution generated by the policy neural network π t for a t-th tuple by processing the state representation s t in the tuple in accordance with the current values of the policy parameters, and H is the minimum expected entropy value. 9 . (canceled) 10 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to control a robot interacting with an environment, wherein the policy neural network is configured to receive as input a state representation characterizing a state of the environment and to process the state representation in accordance with the policy parameters to generate a policy output that defines a probability distribution over a set of actions that can be performed by the robot, the operations comprising: obtaining trajectory data comprising one or more tuples, each tuple identifying a state representation characterizing a state of the environment, an action performed by the robot when the environment was in the state characterized by the state representation, a reward received in response to the robot performing the action, and a next state representation characterizing a next state of the environment after the robot performed the action; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both (i) a reward term that measures total rewards in the tuples in the trajectory data and (ii) an entropy term that measures an entropy of probability distributions defined by policy outputs generated by processing the state representations in the tuples in the trajectory data in accordance with the current values of the policy parameters, wherein a relative weight between the entropy term and the reward term in the maximization is determined by a temperature parameter; and updating, using the probability distributions defined by the policy outputs generated in accordance with the current values of the policy parameters for the tuples in the trajectory data, the temperature parameter to regulate an expected entropy of the probability distributions to at least equal a minimum expected entropy value. 11 . The system of claim 10 , the operations further comprising: controlling the robot using the policy neural network and in accordance with the updated values of the policy parameters. 12 . The system of claim 10 , wherein updating the temperature parameter comprises updating the temperature parameter using an objective function that depends on, for each of the probability distributions, the temperature parameter, the entropy of the probability distribution, and the minimum expected entropy value. 13 . The system of claim 12 , wherein the minimum expected entropy value is based on a number of action dimensions in the actions in the possible set of actions. 14 . The system of claim 13 , wherein the minimum expected entropy value is a negative of the number of action dimensions. 15 . The system of claim 12 , wherein updating the temperature parameter comprises: determining, for each of the one more tuples, a gradient with respect to the temperature parameter of the objective function; and updating the temperature parameter using the determined gradients. 16 . The system of claim 15 , wherein determining the gradient comprises, for each of the tuples: sampling an action from the probability distribution generated by the policy neural network for the tuple; and determining a difference between (i) a negative of a logarithm of the probability assigned to the sampled action by the probability distribution and (ii) the minimum expected entropy value. 17 . The system of claim 12 , wherein the objective function satisfies: J (α)= E a t ˜π t [−α log π t

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Reinforcement learning · CPC title

  • G06N3/008Primary

    based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2022019866A1 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a policy neural network having policy parameters. One of the methods includes obtaining trajectory data comprising one or more tuples; updating, using the trajectory data, current values of the policy parameters using a maximum entropy reinforcement learning technique that maximizes both…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/008. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 20 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).