Who is the assignee on this patent?

Sony Corp, Sony Corp America, Sony Group Corp

What technology area does this patent fall under?

Primary CPC classification G06N7/01. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Reinforcement learning through a double actor critic algorithm

Patent metadata
Field	Value
Publication number	US-11816591-B2
Application number	US-202016800463-A
Country	US
Kind code	B2
Filing date	Feb 25, 2020
Priority date	Mar 20, 2019
Publication date	Nov 14, 2023
Grant date	Nov 14, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The Double Actor Critic (DAC) reinforcement-learning algorithm affords stable policy improvement and aggressive neural-net optimization without catastrophic overfitting of the policy. DAC trains models using an arbitrary history of data in both offline and online learning and can be used to smoothly improve on an existing policy learned or defined by some other means. Finally, DAC can optimize reinforcement learning problems with discrete and continuous action spaces.

First claim

Opening claim text (preview).

What is claimed is: 1. A reinforcement learning algorithm for an agent, the algorithm comprising: using an action-value model for training a policy model, the action-value model estimating, within one or more processors of the agent, an expected future discounted reward that would be received if a hypothetical action was selected under a current observation of the agent and the agent's behavior was followed thereafter; and maintaining a stale copy of both the action-value model and the policy model, wherein the stale copy is initialized identically to a fresh copy of a fresh action-value model and a fresh policy model and is slowly moved to match the fresh copy as learning updates are performed on the fresh copy, wherein the stale copy of the policy model acts as an old policy to be evaluated by the fresh action-value model; the stale copy of the action-value model provides Q-values of an earlier policy model on which the fresh policy model improves; and the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model. 2. The algorithm of claim 1 , wherein the action-value model estimates the expected future discounted reward, Q, as Q ( s,a )= E [Σ t=1 ∞ γ t−1 r t |s, a , π], where r t is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards. 3. The algorithm of claim 1 , wherein: the stale model converges to a same convergence point as the fresh model, albeit the stale model, due to its slow movement toward the fresh model, reaches the same convergence point at a time later than that of the fresh model. 4. The algorithm of claim 1 , wherein an output of the policy model, π(s), for a given observation (s) of an environment state, are parameters of probability distributions over a domain of an action space. 5. The algorithm of claim 4 , wherein, when the action space is a discrete action space, the parameters outputted are probability mass values. 6. The algorithm of claim 4 , wherein, when the action space is a continuous n-dimensional action space, the parameters outputted are a mean and a covariance of a multivariate Gaussian distribution over the action space. 7. The algorithm of claim 1 , wherein the offline variant includes an offline algorithm comprising: sampling minibatches of tuples from available data; computing a critic loss function, L Q , and an actor loss function, L π ; differentiating each of the critic loss function and the actor loss function with respect to neural-net parameters; performing a stochastic gradient-descent-based update to the neural-net parameters; and updating the stale copy toward the fresh copy by a geometric coefficient. 8. The algorithm of claim 7 , wherein: for a discrete-action case, a target for the critic loss function is computed exactly by marginalizing over a probability of each action selection by the stale policy model; and for a discrete action case, a target for the actor loss is computed exactly and a cross entropy loss function is used to make the policy model match the target. 9. The algorithm of claim 7 , wherein, for a continuous-action case, targets of the critic loss function and the actor loss function are not computed exactly, where sampling from the policy model and the stale copy of the policy model are used to stochastically approximate the targets, where a variance from the sampling is smoothed by a stochastic gradient descent process. 10. The algorithm of claim 7 , wherein a target of each of the critic loss function and the actor loss function is an optimal solution that minimizing the respective critic loss function and the actor loss function would produce. 11. The algorithm of claim 7 , wherein a target (T Q ) of the critic loss function for a given reward, and resulting observation is a scalar value defined by the formula— T Q ( r,s ′) r+γE a′˜π(s′;ϕ′) [ Q ( s′,a ′,θ′)]. 12. The algorithm of claim 7 , wherein a target (T π ) of the actor loss function is a probability distribution over the Q-values from the stale copy of the action-value model in which a density for each action is defined as T π ⁡ ( s , a ) ⁢ = Δ ⁢ exp ⁡ ( 1 τ ⁢ Q ⁡ ( s , a ; θ ′ ) ) ∫ a ′ ⁢ exp ⁡ ( 1 τ ⁢ Q ⁡ ( s , a ′ ; θ ′ ) ) , wherein τ is a temperature hyperparameter that defines how greedy a target distribution is towards a highest scoring Q-value, where as the temperature hyperparameter approaches zero, the probability distribut

Assignees

Inventors

Macglashan James

Classifications

G06N3/0499
Feedforward networks · CPC title
G06N3/092
Reinforcement learning · CPC title
G06N7/01Primary
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06N20/00
Machine learning · CPC title
G06N3/084Primary
Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

View patent family 72515874

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11816591B2 cover?: The Double Actor Critic (DAC) reinforcement-learning algorithm affords stable policy improvement and aggressive neural-net optimization without catastrophic overfitting of the policy. DAC trains models using an arbitrary history of data in both offline and online learning and can be used to smoothly improve on an existing policy learned or defined by some other means. Finally, DAC can optimize …
Who is the assignee on this patent?: Sony Corp, Sony Corp America, Sony Group Corp
What technology area does this patent fall under?: Primary CPC classification G06N7/01. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

System and method for deep reinforcement learning

Reinforcement learning with auxiliary tasks

System monitoring

Methods and apparatus for reinforcement learning

Frequently asked questions