Meta-gradient updates for training return functions for reinforcement learning systems

US11836620B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11836620-B2
Application numberUS-202017112220-A
CountryUS
Kind codeB2
Filing dateDec 4, 2020
Priority dateMay 18, 2018
Publication dateDec 5, 2023
Grant dateDec 5, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for reinforcement learning. The embodiments described herein apply meta-learning (and in particular, meta-gradient reinforcement learning) to learn an optimum return function G so that the training of the system is improved. This provides a more effective and efficient means of training a reinforcement learning system as the system is able to converge on an optimum set of one or more policy parameters θ more quickly by training the return function G as it goes. In particular, the return function G is made dependent on the one or more policy parameters θ and a meta-objective function J′ is used that is differentiated with respect to the one or more return parameters η to improve the training of the return function G.

First claim

Opening claim text (preview).

What is claimed is: 1. A reinforcement learning system comprising one or more computers configured to: retrieve training data comprising a plurality of experiences generated as a result of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, each experience comprising an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action; and train a reinforcement learning neural network having one or more policy parameters to control the agent to perform the task by jointly training (i) the reinforcement learning neural network and (ii) an objective function that has one or more objective function parameters and that evaluates performance of the agent based on the actions performed by the agent, comprising: updating the one or more policy parameters for the reinforcement learning neural network based on a first set of the experiences using the objective function; updating the one or more objective function parameters of the objective function based on the one or more updated policy parameters and a second set of the experiences, wherein the one or more objective function parameters are updated via a gradient ascent or descent method using a meta-objective function differentiated with respect to the one or more objective function parameters, wherein the meta-objective function is dependent on the one or more policy parameters; retrieving updated experiences generated as a result of the agent interacting with the environment to perform the task under the control of the reinforcement neural network using the one or more updated policy parameters and the one or more updated objective function parameters; further updating the one or more policy parameters based on a first set of the updated experiences using the one or more updated objective function parameters; and further updating the one or more objective function parameters based on the further updated policy parameters and a second set of the updated experiences via the gradient ascent or descent method. 2. The reinforcement learning system of claim 1 , wherein updating the one or more objective function parameters utilizes a differential of the one or more updated policy parameters with respect to the one or more objective function parameters. 3. The reinforcement learning system of claim 1 , wherein updating the one or more objective function parameters comprises applying a further objective function as part of the meta-objective function and evaluating the updated policy from the further objective function when applied to the second set of updated experiences. 4. The reinforcement learning system of claim 1 , wherein the updating of the one or more policy parameters applies one or more of a policy and a value function that are conditioned on the one or more objective function parameters. 5. The reinforcement learning system of claim 4 , wherein the conditioning is via an embedding of the one or more objective function parameters. 6. The reinforcement learning system of claim 1 , wherein the one or more objective function parameters comprise one or more of a discount factor of the objective function and a bootstrapping factor of the objective function. 7. The reinforcement learning system of claim 1 , wherein the one or more computers are further configured to: update the one or more policy parameters for the reinforcement learning neural network based on the second set of the experiences; and update the one or more objective function parameters of the objective function based on the one or more updated policy parameters and the first set of the experiences, wherein the one or more objective function parameters are updated via the gradient ascent or descent method. 8. The reinforcement learning system of claim 1 , wherein the differentiated meta-objective function is: δ ⁢ J ′ ( τ ′ , θ ′ , η ′ ) ∂ η = ∂ J ′ ( τ ′ , θ ′ , η ′ ) ∂ θ ′ ⁢ d ⁢ θ ′ d ⁢ η where: η are the one or more objective function parameters; and J′(τ′, θ′, η′) is the meta-objective function conditioned on the second set of experiences τ′, the one or more updated policy parameters θ′ and one or more further objective function parameters η′ of a further objective function forming part of the meta-objective function. 9. The reinforcement learning system of claim 8 , wherein the reinforcement learning system is configured to calculate the differentiated meta-objective function based on a differential of the updated policy parameters θ′ with respect to the one or more objective function parameters η, dθ′/dη, calculated by adding a differential of an update function with respect to the one or more objective function parameters, dƒ(τ, θ, η)/dη, the update function being for updating the policy, to a differential of the policy parameters θ with respect to the objective function parameters η, dθ/dη. 10. The reinforcement learning system of claim 9 wherein the differential of the update function with respect to the objective function parameters, ∂ƒ(η, θ, η)/∂η, is calculated via: ∂ f ⁡ ( τ , θ , η ) ∂ η

Assignees

Inventors

Classifications

  • G06N3/092Primary

    Reinforcement learning · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11836620B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for reinforcement learning. The embodiments described herein apply meta-learning (and in particular, meta-gradient reinforcement learning) to learn an optimum return function G so that the training of the system is improved. This provides a more effective and efficient means of training a reinforc…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 05 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).