Control policies for collective robot learning
US-11188821-B1 · Nov 30, 2021 · US
US2021237266A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021237266-A1 |
| Application number | US-201917052679-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 14, 2019 |
| Priority date | Jun 15, 2018 |
| Publication date | Aug 5, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Using large-scale reinforcement learning to train a policy model that can be utilized by a robot in performing a robotic task in which the robot interacts with one or more environmental objects. In various implementations, off-policy deep reinforcement learning is used to train the policy model, and the off-policy deep reinforcement learning is based on self-supervised data collection. The policy model can be a neural network model. Implementations of the reinforcement learning utilized in training the neural network model utilize a continuous-action variant of Q-learning. Through techniques disclosed herein, implementations can learn policies that generalize effectively to previously unseen objects, previously unseen environments, etc.
Opening claim text (preview).
1 . A method of training a neural network model that represents a Q-function, the method implemented by a plurality of processors, and the method comprising: retrieving a robotic transition, the robotic transition generated based on data from an episode of a robot performing a robotic task, and the robotic transition including: state data that comprises vision data captured by a vision component at a state of the robot during the episode, next state data that comprises next vision data captured by the vision component at a next state of the robot during the episode, the next state being transitioned to from the state, an action executed to transition from the state to the next state, and a reward for the robotic transition; determining a target Q-value for the robotic transition, wherein determining the target Q-value comprises: performing an optimization over candidate robotic actions using, as an objective function, a version of a neural network model that represents the Q-function, wherein performing the optimization comprises generating Q-values for a subset of the candidate robotic actions that are considered in the optimization, wherein generating each of the Q-values is based on processing of the next state data and a corresponding one of the candidate robotic actions of the subset using the version of the neural network model, selecting, from the generated Q-values, a maximum Q-value, and determining the target Q-value based on the maximum Q-value and the reward; storing, in a training buffer: the state data, the action, and the target Q-value; retrieving, from the training buffer: the state data, the action, and the target Q-value; generating a predicted Q-value, wherein generating the predicted Q-value comprises processing the retrieved state data and the retrieved action using a current version of the neural network model, wherein the current version of the neural network model is updated relative to the version; generating a loss based on the predicted Q-value and the target Q-value; and updating the current version of the neural network model based on the loss. 2 . The method of claim 1 , wherein the robotic transition is generated based on offline data and is retrieved from an offline buffer. 3 . The method of claim 2 , wherein retrieving the robotic transition from the offline buffer is based on a dynamic offline sampling rate for sampling from the offline buffer, wherein the dynamic offline sampling rate decreases as a duration of training the neural network model increases. 4 . The method of claim 3 , further comprising generating the robotic transition by accessing an offline database that stores offline episodes. 5 . The method of claim 1 , wherein the robotic transition is generated based on online data and is retrieved from an online buffer, wherein the online data is generated by a robot performing episodes of the robotic task using a robot version of the neural network model. 6 . The method of claim 5 , wherein retrieving the robotic transition from the online buffer is based on a dynamic online sampling rate for sampling from the online buffer, wherein the dynamic online sampling rate increases as a duration of training the neural network model increases. 7 . The method of claim 5 , further comprising updating the robot version of the neural network model based on the loss. 8 . The method of claim 1 , wherein the action comprises a pose change for a component of the robot, wherein the pose change defines a difference between a pose of the component at the state and a next pose of the component at the next state. 9 . The method of claim 8 , wherein the component is an end effector and the pose change defines a translation difference for the end effector and a rotation difference for the end effector. 10 . The method of claim 9 , wherein the end effector is a gripper and the robotic task is a grasping task. 11 . The method of claim 8 , wherein the action further comprises a termination command when the next state is a terminal state of the episode. 12 . The method of claim 8 , wherein the action further comprises a component action command that defines a dynamic state, of the component, in the next state of the episode the dynamic state being in addition to translation and rotation of the component. 13 . The method of claim 12 , wherein the component is a gripper and wherein the dynamic state dictated by the component action command indicates that the gripper is to be closed. 14 . The method of claim 1 , wherein the state data further comprises a current status of a component of the robot. 15 . The method of claim 14 , wherein the component of the robot is a gripper and the current status indicates whether the gripper is opened or closed. 16 . The method of claim 1 , wherein the optimization is a stochastic optimization or is a cross-entropy method (CEM). 17 . The method of claim 1 , wherein performing the optimization over the candidate robotic actions comprises: selecting an initial batch of the candidate robotic actions; generating a corresponding one of the Q-values for each of the candidate robotic actions in the initial batch; selecting an initial subset of the candidate robotic actions in the initial batch based on the Q-values for the candidate robotic actions in the initial batch; fitting a Gaussian distribution to the selected initial subset of the candidate robotic actions; selecting a next batch of the candidate robotic actions based on the Gaussian distribution; and generating a corresponding one of the Q-values for each of the candidate robotic actions in the next batch. 18 . The method of claim 17 , wherein the maximum Q-value is one of the Q-values of the candidate robotic actions in the next batch and wherein selecting the maximum Q-value is based on the maximum Q-value being the maximum Q-value of the corresponding Q-values of the next batch. 19 . A method of training a neural network model that represents a policy, the method implemented by a plurality of processors, and the method comprising: retrieving a robotic transition, the robotic transition generated based on data from an episode of a robot performing a robotic task, and the robotic transition including state data and an action; determining a target value for the robotic transition, wherein determining the target value comprises: performing an optimization over candidate robotic actions using, as an objective function, a version of a neural network model that represents the policy; storing, in a training buffer: the state data, the action, and the target value; retrieving, from the training buffer: the state data, the action data, and the target value; generating a predicted value, wherein generating the predicted value comprises processing the retrieved state data and the retrieved action data using a current version of the neural network model, wherein the current version of the neural network model is updated relative to the version; generating a loss based on the predicted value and the target value; and updating the current version of the neural network model based on the loss. 20 . A method implemented by one or more processors of a robot during performance of a robotic task, the method comprising: receiving current state data for the robot, the current state data comprising current sensor data of the robot; selecting a robotic action to be performed for the robotic task, wherein selecting the robotic action comprises: performing an optimization over
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Reinforcement learning · CPC title
Distributed learning, e.g. federated learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.