Reinforcement learning through a double actor critic algorithm

US11816591B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11816591-B2
Application numberUS-202016800463-A
CountryUS
Kind codeB2
Filing dateFeb 25, 2020
Priority dateMar 20, 2019
Publication dateNov 14, 2023
Grant dateNov 14, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The Double Actor Critic (DAC) reinforcement-learning algorithm affords stable policy improvement and aggressive neural-net optimization without catastrophic overfitting of the policy. DAC trains models using an arbitrary history of data in both offline and online learning and can be used to smoothly improve on an existing policy learned or defined by some other means. Finally, DAC can optimize reinforcement learning problems with discrete and continuous action spaces.

First claim

Opening claim text (preview).

What is claimed is: 1. A reinforcement learning algorithm for an agent, the algorithm comprising: using an action-value model for training a policy model, the action-value model estimating, within one or more processors of the agent, an expected future discounted reward that would be received if a hypothetical action was selected under a current observation of the agent and the agent's behavior was followed thereafter; and maintaining a stale copy of both the action-value model and the policy model, wherein the stale copy is initialized identically to a fresh copy of a fresh action-value model and a fresh policy model and is slowly moved to match the fresh copy as learning updates are performed on the fresh copy, wherein the stale copy of the policy model acts as an old policy to be evaluated by the fresh action-value model; the stale copy of the action-value model provides Q-values of an earlier policy model on which the fresh policy model improves; and the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model. 2. The algorithm of claim 1 , wherein the action-value model estimates the expected future discounted reward, Q, as Q ( s,a )= E [Σ t=1 ∞ γ t−1 r t |s, a , π], where r t is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards. 3. The algorithm of claim 1 , wherein: the stale model converges to a same convergence point as the fresh model, albeit the stale model, due to its slow movement toward the fresh model, reaches the same convergence point at a time later than that of the fresh model. 4. The algorithm of claim 1 , wherein an output of the policy model, π(s), for a given observation (s) of an environment state, are parameters of probability distributions over a domain of an action space. 5. The algorithm of claim 4 , wherein, when the action space is a discrete action space, the parameters outputted are probability mass values. 6. The algorithm of claim 4 , wherein, when the action space is a continuous n-dimensional action space, the parameters outputted are a mean and a covariance of a multivariate Gaussian distribution over the action space. 7. The algorithm of claim 1 , wherein the offline variant includes an offline algorithm comprising: sampling minibatches of tuples from available data; computing a critic loss function, L Q , and an actor loss function, L π ; differentiating each of the critic loss function and the actor loss function with respect to neural-net parameters; performing a stochastic gradient-descent-based update to the neural-net parameters; and updating the stale copy toward the fresh copy by a geometric coefficient. 8. The algorithm of claim 7 , wherein: for a discrete-action case, a target for the critic loss function is computed exactly by marginalizing over a probability of each action selection by the stale policy model; and for a discrete action case, a target for the actor loss is computed exactly and a cross entropy loss function is used to make the policy model match the target. 9. The algorithm of claim 7 , wherein, for a continuous-action case, targets of the critic loss function and the actor loss function are not computed exactly, where sampling from the policy model and the stale copy of the policy model are used to stochastically approximate the targets, where a variance from the sampling is smoothed by a stochastic gradient descent process. 10. The algorithm of claim 7 , wherein a target of each of the critic loss function and the actor loss function is an optimal solution that minimizing the respective critic loss function and the actor loss function would produce. 11. The algorithm of claim 7 , wherein a target (T Q ) of the critic loss function for a given reward, and resulting observation is a scalar value defined by the formula— T Q ( r,s ′) r+γE a′˜π(s′;ϕ′) [ Q ( s′,a ′,θ′)]. 12. The algorithm of claim 7 , wherein a target (T π ) of the actor loss function is a probability distribution over the Q-values from the stale copy of the action-value model in which a density for each action is defined as T π ⁡ ( s , a ) ⁢ = Δ ⁢ exp ⁡ ( 1 τ ⁢ Q ⁡ ( s , a ; θ ′ ) ) ∫ a ′ ⁢ exp ⁡ ( 1 τ ⁢ Q ⁡ ( s , a ′ ; θ ′ ) ) , wherein τ is a temperature hyperparameter that defines how greedy a target distribution is towards a highest scoring Q-value, where as the temperature hyperparameter approaches zero, the probability distribut

Assignees

Inventors

Classifications

  • Feedforward networks · CPC title

  • Reinforcement learning · CPC title

  • G06N7/01Primary

    Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Machine learning · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11816591B2 cover?
The Double Actor Critic (DAC) reinforcement-learning algorithm affords stable policy improvement and aggressive neural-net optimization without catastrophic overfitting of the policy. DAC trains models using an arbitrary history of data in both offline and online learning and can be used to smoothly improve on an existing policy learned or defined by some other means. Finally, DAC can optimize …
Who is the assignee on this patent?
Sony Corp, Sony Corp America, Sony Group Corp
What technology area does this patent fall under?
Primary CPC classification G06N7/01. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).