Interactive reinforcement learning with dynamic reuse of prior knowledge

US2019236458A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2019236458-A1
Application numberUS-201916263930-A
CountryUS
Kind codeA1
Filing dateJan 31, 2019
Priority dateJan 31, 2018
Publication dateAug 1, 2019
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and computer readable media directed to interactive reinforcement learning with dynamic reuse of prior knowledge are described in various embodiments. The interactive reinforcement learning is adapted for providing computer implemented systems for dynamic action selection based on confidence levels associated with demonstrator data or portions thereof.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for biasing a machine learning architecture using one or more demonstrator data sets, the machine learning architecture for controlling one or more actions conducted by an agent in an environment which transitions between one or more states, the system comprising: a receiver configured to obtain one or more demonstrator data sets, each demonstrator data set including a data structure representing one or more state-action pairs observed in one or more interactions with the environment; a data storage configured to maintain, for each demonstrator data set or sub-portions thereof, one or more confidence data values, associated with at least one state of the one or more states; a supervised classifier for training using the one or more demonstrator data sets or sub-portions thereof; an action execution processor configured to generate control signals for executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture, the selecting based at least upon the one or more confidence data values; and a state observer configured to monitor a new state resulting from the execution of the action and an associated reward outcome; and to update the internal policy function maintained by the machine learning architecture based at least on the observed reward outcome. 2 . The system of claim 1 , wherein the state observer is configured to update at least one of the confidence data values of the one or more confidence data values based on the observed reward outcome. 3 . The system of claim 1 , wherein the confidence data values are generated using a dynamic temporal difference confidence measurement based on the relation: C(s)←(1−F(α))×C(s)+F(α)×[F(r)+γ×C(s′)] where γ is a discount factor, r is a reward function, and α is an update parameter. 4 . The system of claim 3 , wherein the temporal difference confidence measurement includes a dynamic rate update function based on the relation: F  ( α ) = α × max  { 1 Σ i   exp  ( θ i T · x )  [ exp  ( θ 1 T · x ) ) exp  ( θ 2 T · x ) ) … exp  ( θ i T · x ) ) ] } . 5 . The system of claim 3 , wherein the temporal difference confidence measurement includes a dynamic confidence update function based on the relation: F  ( r ) = r r_max × max  { 1 Σ i   exp  ( θ i

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • G06N5/022Primary

    Knowledge engineering; Knowledge acquisition · CPC title

  • G06N3/006Primary

    based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • Function evaluation by approximation methods, e.g. inter- or extrapolation, smoothing, least mean square method ({G06F17/18 takes precedence } ; interpolation for numerical control G05B19/18) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019236458A1 cover?
Systems, methods, and computer readable media directed to interactive reinforcement learning with dynamic reuse of prior knowledge are described in various embodiments. The interactive reinforcement learning is adapted for providing computer implemented systems for dynamic action selection based on confidence levels associated with demonstrator data or portions thereof.
Who is the assignee on this patent?
Royal Bank Of Canada
What technology area does this patent fall under?
Primary CPC classification G06N5/022. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 01 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).