System and method for deep reinforcement learning
US-2020143206-A1 · May 7, 2020 · US
US11816591B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11816591-B2 |
| Application number | US-202016800463-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 25, 2020 |
| Priority date | Mar 20, 2019 |
| Publication date | Nov 14, 2023 |
| Grant date | Nov 14, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The Double Actor Critic (DAC) reinforcement-learning algorithm affords stable policy improvement and aggressive neural-net optimization without catastrophic overfitting of the policy. DAC trains models using an arbitrary history of data in both offline and online learning and can be used to smoothly improve on an existing policy learned or defined by some other means. Finally, DAC can optimize reinforcement learning problems with discrete and continuous action spaces.
Opening claim text (preview).
What is claimed is: 1. A reinforcement learning algorithm for an agent, the algorithm comprising: using an action-value model for training a policy model, the action-value model estimating, within one or more processors of the agent, an expected future discounted reward that would be received if a hypothetical action was selected under a current observation of the agent and the agent's behavior was followed thereafter; and maintaining a stale copy of both the action-value model and the policy model, wherein the stale copy is initialized identically to a fresh copy of a fresh action-value model and a fresh policy model and is slowly moved to match the fresh copy as learning updates are performed on the fresh copy, wherein the stale copy of the policy model acts as an old policy to be evaluated by the fresh action-value model; the stale copy of the action-value model provides Q-values of an earlier policy model on which the fresh policy model improves; and the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model. 2. The algorithm of claim 1 , wherein the action-value model estimates the expected future discounted reward, Q, as Q ( s,a )= E [Σ t=1 ∞ γ t−1 r t |s, a , π], where r t is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards. 3. The algorithm of claim 1 , wherein: the stale model converges to a same convergence point as the fresh model, albeit the stale model, due to its slow movement toward the fresh model, reaches the same convergence point at a time later than that of the fresh model. 4. The algorithm of claim 1 , wherein an output of the policy model, π(s), for a given observation (s) of an environment state, are parameters of probability distributions over a domain of an action space. 5. The algorithm of claim 4 , wherein, when the action space is a discrete action space, the parameters outputted are probability mass values. 6. The algorithm of claim 4 , wherein, when the action space is a continuous n-dimensional action space, the parameters outputted are a mean and a covariance of a multivariate Gaussian distribution over the action space. 7. The algorithm of claim 1 , wherein the offline variant includes an offline algorithm comprising: sampling minibatches of tuples from available data; computing a critic loss function, L Q , and an actor loss function, L π ; differentiating each of the critic loss function and the actor loss function with respect to neural-net parameters; performing a stochastic gradient-descent-based update to the neural-net parameters; and updating the stale copy toward the fresh copy by a geometric coefficient. 8. The algorithm of claim 7 , wherein: for a discrete-action case, a target for the critic loss function is computed exactly by marginalizing over a probability of each action selection by the stale policy model; and for a discrete action case, a target for the actor loss is computed exactly and a cross entropy loss function is used to make the policy model match the target. 9. The algorithm of claim 7 , wherein, for a continuous-action case, targets of the critic loss function and the actor loss function are not computed exactly, where sampling from the policy model and the stale copy of the policy model are used to stochastically approximate the targets, where a variance from the sampling is smoothed by a stochastic gradient descent process. 10. The algorithm of claim 7 , wherein a target of each of the critic loss function and the actor loss function is an optimal solution that minimizing the respective critic loss function and the actor loss function would produce. 11. The algorithm of claim 7 , wherein a target (T Q ) of the critic loss function for a given reward, and resulting observation is a scalar value defined by the formula— T Q ( r,s ′) r+γE a′˜π(s′;ϕ′) [ Q ( s′,a ′,θ′)]. 12. The algorithm of claim 7 , wherein a target (T π ) of the actor loss function is a probability distribution over the Q-values from the stale copy of the action-value model in which a density for each action is defined as T π ( s , a ) = Δ exp ( 1 τ Q ( s , a ; θ ′ ) ) ∫ a ′ exp ( 1 τ Q ( s , a ′ ; θ ′ ) ) , wherein τ is a temperature hyperparameter that defines how greedy a target distribution is towards a highest scoring Q-value, where as the temperature hyperparameter approaches zero, the probability distribut
Related publications grouped by family.
Answers are generated from the same data shown on this page.