What technology area does this patent fall under?

Primary CPC classification G06N3/006. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 19 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Interactive reinforcement learning with dynamic reuse of prior knowledge

Patent metadata
Field	Value
Publication number	US-11308401-B2
Application number	US-201916263930-A
Country	US
Kind code	B2
Filing date	Jan 31, 2019
Priority date	Jan 31, 2018
Publication date	Apr 19, 2022
Grant date	Apr 19, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and computer readable media directed to interactive reinforcement learning with dynamic reuse of prior knowledge are described in various embodiments. The interactive reinforcement learning is adapted for providing computer implemented systems for dynamic action selection based on confidence levels associated with demonstrator data or portions thereof.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for biasing a machine learning architecture using one or more demonstrator data sets, the machine learning architecture for controlling one or more actions conducted by an agent in an environment which transitions between one or more states, the system comprising: a physical computer processor operating in conjunction with computer memory and computer storage, the processor configured to provide: a receiver configured to obtain one or more demonstrator data sets, each demonstrator data set including a data structure representing one or more state-action pairs observed in one or more interactions with the environment; a data storage configured to maintain, for each demonstrator data set or sub-portions thereof, one or more confidence data values, associated with at least one state of the one or more states; a supervised classifier for training using the one or more demonstrator data sets or sub-portions thereof; an action execution processor configured to generate control signals for executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture, the selecting based at least upon the one or more confidence data values; and a state observer configured to monitor a new state resulting from the execution of the action and an associated reward outcome; and to update the internal policy function maintained by the machine learning architecture based at least on the observed reward outcome; wherein the one or more confidence data values are generated using a dynamic temporal difference confidence measurement; and wherein the dynamic temporal difference confidence measurement is based on the relation: C ( s )←(1− F (α)) XC ( s )+ F (α) X [ F ( r )+γ XC ( s ′)] where γ is a discount factor, r is a reward function, and α is an update parameter. 2. The system of claim 1 , wherein the state observer is configured to update at least one of the confidence data values of the one or more confidence data values based on the observed reward outcome. 3. The system of claim 1 , wherein the temporal difference confidence measurement includes a dynamic rate update function based on the relation: F ⁡ ( α ) = α × max ⁢ { 1 Σ i ⁢ ⁢ exp ⁡ ( θ i T · x ) ⁡ [ exp ⁡ ( θ 1 T · x ) ) exp ⁡ ( θ 2 T · x ) ) … exp ⁡ ( θ i T · x ) ) ] } . 4. The system of claim 1 , wherein the temporal difference confidence measurement includes a dynamic confidence update function based on the relation: F ⁡ ( r ) = r r_max × max ⁢ { 1 Σ i ⁢ ⁢ exp ⁡

Assignees

Royal Bank Of Canada

Inventors

Classifications

G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/092
Reinforcement learning · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/096
Transfer learning · CPC title

Patent family

Related publications grouped by family.

View patent family 67393585

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11308401B2 cover?: Systems, methods, and computer readable media directed to interactive reinforcement learning with dynamic reuse of prior knowledge are described in various embodiments. The interactive reinforcement learning is adapted for providing computer implemented systems for dynamic action selection based on confidence levels associated with demonstrator data or portions thereof.
Who is the assignee on this patent?: Royal Bank Of Canada
What technology area does this patent fall under?: Primary CPC classification G06N3/006. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 19 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).