Environment prediction using reinforcement learning

US10733501B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10733501-B2
Application numberUS-201916403314-A
CountryUS
Kind codeB2
Filing dateMay 3, 2019
Priority dateNov 4, 2016
Publication dateAug 4, 2020
Grant dateAug 4, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for prediction of an outcome related to an environment. In one aspect, a system comprises a state representation neural network that is configured to: receive an observation characterizing a state of an environment being interacted with by an agent and process the observation to generate an internal state representation of the environment state; a prediction neural network that is configured to receive a current internal state representation of a current environment state and process the current internal state representation to generate a predicted subsequent state representation of a subsequent state of the environment and a predicted reward for the subsequent state; and a value prediction neural network that is configured to receive a current internal state representation of a current environment state and process the current internal state representation to generate a value prediction.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more data processing apparatus, the method comprising: receiving, by the one or more data processing apparatus, one or more observations characterizing states of an environment being interacted with by an agent; providing, by the one or more data processing apparatus, the one or more observations as input to a state representation neural network, wherein the state representation neural network is configured to: receive the one or more observations, and process the one or more observations to generate an internal state representation of a current environment state; for each of a plurality of internal time steps: generating, by the one or more data processing apparatus, using a prediction neural network and a value prediction neural network and from an internal state representation for the internal time step: (i) an internal state representation for a next internal time step, (ii) a predicted reward for the next internal time step, and (iii) a value prediction that is an estimate of a future cumulative discounted reward from the next internal time step onwards; wherein the prediction neural network is configured to, for each of the plurality of internal time steps: receive the internal state representation for the internal time step; and process the internal state representation for the internal time step to generate: an internal state representation for a next internal time step, and a predicted reward for the next internal time step; wherein the value prediction neural network is configured to, for each of the plurality of internal time steps: receive the internal state representation for the internal time step, and process the internal state representation for the internal time step to generate a value prediction that is an estimate of a future cumulative discounted reward from the next internal time step onwards; and determining, by the one or more data processing apparatus, an aggregate reward from the predicted rewards and the value predictions for the internal time steps, wherein the aggregate reward is an estimate of an outcome associated with the states of the environment characterized by the observations. 2. The method of claim 1 , further comprising: providing the aggregate reward as an estimate of the outcome associated with the states of the environment characterized by the observations. 3. The method of claim 1 , wherein the prediction neural network is further configured to generate a predicted discount factor for the next internal time step, and further comprising using the predicted discount factors for the internal time steps in determining the aggregate reward. 4. The method of claim 1 , wherein the state representation neural network comprises a recurrent neural network. 5. The method of claim 3 , further comprising: for each internal time step, processing, by the one or more data processing apparatus, an internal state representation for the internal time step using a lambda neural network to generate a lambda factor for the next internal time step; determining, by the one or more data processing apparatus, a respective k-step return for each internal time step by combining the predicted reward and the predicted discount factor for each of a first k internal time steps and the value prediction for a k-th internal time step; and using the lambda factors to determine weights for the k-step returns in determining the aggregate reward. 6. The method of claim 1 , wherein the state representation neural network comprises a feedforward neural network. 7. The method of claim 1 , wherein the prediction neural network comprises a recurrent neural network. 8. The method of claim 1 , wherein the prediction neural network comprises a feedforward neural network that has different parameter values at each of the plurality of internal time steps. 9. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving, by the one or more computers, one or more observations characterizing states of an environment being interacted with by an agent; providing, by the one or more computers, the one or more observations as input to a state representation neural network, wherein the state representation neural network is configured to: receive the one or more observations, and process the one or more observations to generate an internal state representation of a current environment state; for each of a plurality of internal time steps: generating, by the one or more computers, using a prediction neural network and a value prediction neural network and from an internal state representation for the internal time step: (i) an internal state representation for a next internal time step, (ii) a predicted reward for the next internal time step, and (iii) a value prediction that is an estimate of a future cumulative discounted reward from the next internal time step onwards; wherein the prediction neural network is configured to, for each of the plurality of internal time steps: receive the internal state representation for the internal time step; and process the internal state representation for the internal time step to generate: an internal state representation for a next internal time step, and a predicted reward for the next internal time step; wherein the value prediction neural network is configured to, for each of the plurality of internal time steps: receive the internal state representation for the internal time step, and process the internal state representation for the internal time step to generate a value prediction that is an estimate of a future cumulative discounted reward from the next internal time step onwards; and determining, by the one or more computers, an aggregate reward from the predicted rewards and the value predictions for the internal time steps, wherein the aggregate reward is an estimate of an outcome associated with the states of the environment characterized by the observations. 10. The system of claim 9 , wherein the operations further comprise: providing the aggregate reward as an estimate of the outcome associated with the states of the environment characterized by the observations. 11. The system of claim 9 , wherein the prediction neural network is further configured to generate a predicted discount factor for the next internal time step, and wherein the operations further comprise using the predicted discount factors for the internal time steps in determining the aggregate reward. 12. The system of claim 9 , wherein the state representation neural network comprises a recurrent neural network. 13. The system of claim 11 , wherein the operations further comprise: for each internal time step, processing, by the one or more data processing apparatus, an internal state representation for the internal time step using a lambda neural network to generate a lambda factor for the next internal time step; determining, by the one or more computers, a respective k-step return for each internal time step by combining the predicted reward and the predicted discount factor for each of a first k internal time steps and the value prediction for a k-th internal time step; and using the lambda factors to determine weights for the k-step returns in determining the aggregate reward. 14. The system of claim 9 , wherein the state representation neural network comprises a

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

  • G06N3/006Primary

    based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • Probabilistic or stochastic networks · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10733501B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for prediction of an outcome related to an environment. In one aspect, a system comprises a state representation neural network that is configured to: receive an observation characterizing a state of an environment being interacted with by an agent and process the observation to generate an intern…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 04 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).