Methods and apparatus for reinforcement learning

US9679258B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9679258-B2
Application numberUS-201314097862-A
CountryUS
Kind codeB2
Filing dateDec 5, 2013
Priority dateOct 8, 2013
Publication dateJun 13, 2017
Grant dateJun 13, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

We describe a method of reinforcement learning for a subject system having multiple states and actions to move from one state to the next. Training data is generated by operating on the system with a succession of actions and used to train a second neural network. Target values for training the second neural network are derived from a first neural network which is generated by copying weights of the second neural network at intervals.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of reinforcement learning, the method comprising: obtaining training data relating to a subject system being interacted with by a reinforcement learning agent that performs actions from a set of actions to cause the subject system to move from one state to another state; wherein the training data comprises a plurality of transitions, each transition comprising respective starting state data, action data and next state data defining, respectively, a starting state of the subject system, an action performed by the reinforcement learning agent when the subject system was in the starting state, and a next state of the subject system resulting from the action being performed by the reinforcement learning system; and training a second neural network used to select actions to be performed by the reinforcement learning agent on the transitions in the training data and, for each transition, a respective target output generated by a first neural network, wherein the first neural network is another instance of the second neural network but with possibly different parameter values than those of the first neural network; and during the training, periodically updating the parameter values of the first neural network from current parameter values of the second neural network, wherein the state data and the next state data in each transition are image data. 2. A method as claimed in claim 1 further comprising, after the training selecting actions to be performed by the reinforcement learning agent using the second neural network and in accordance with trained parameter values of the second neural network. 3. A method as claimed in claim 1 , further comprising generating transitions by storing data defining actions selected using the second neural network in association with data defining respective said starting states and next states for the selected actions. 4. A method as claimed in claim 1 wherein the first neural network and the second neural network receive an input comprising state data and generate as output a respective Q-value for each of one or more of the actions in the set of actions, the method further comprising, for each transition, generating the target output by providing the data defining the actions and the next state data as input to the first neural network. 5. A method as claimed in claim 3 further comprising: providing the second neural network with particular state data characterizing a particular state of the subject system; retrieving from the second neural network a respective Q-value for each action of the set of actions; and selecting an action to be performed by the reinforcement learning agent having a maximum or minimum Q-value as generated by the second neural network. 6. A method as claimed in claim 1 wherein the transitions in the training data are generated using the second neural network. 7. A method as claimed in claim 1 , wherein the training comprises, for each transition: providing the first neural network with the next state data; determining, from the first neural network, a maximum or minimum Q-value for the next state; determining the target output for the transition from the maximum or minimum Q-value for the next state. 8. A method as claimed in claim 7 wherein the training further comprises, for each transition: providing the second neural network with the starting state data and adjusting weights of the second neural network to bring a Q-value for the action defined by the action data closer to the target output. 9. A method as claimed in claim 7 wherein each transition further comprises reward data defining a reward value or cost value resulting from the action defined by the action data, and wherein determining the target output comprises adjusting the maximum or minimum parameter-value for the next state by the reward data. 10. A method as claimed in claim 1 wherein the first and second neural networks comprise deep neural networks with a convolutional neural network input stage. 11. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising: obtaining training data relating to a subject system being interacted with by a reinforcement learning agent that performs actions from a set of actions to cause the subject system to move from one state to another state; wherein the training data comprises a plurality of transitions, each transition comprising respective starting state data, action data and next state data defining, respectively, a starting state of the subject system, an action performed by the reinforcement learning agent when the subject system was in the starting state, and a next state of the subject system resulting from the action being performed by the reinforcement learning system; and training a second neural network used to select actions to be performed by the reinforcement learning agent on the transitions in the training data and, for each transition, a respective target output generated by a first neural network, wherein the first neural network is another instance of the second neural network but with possibly different parameter values than those of the first neural network; and during the training, periodically updating the parameter values of the first neural network from current parameter values of the second neural network, wherein the state data and the next state data in each transition are image data. 12. A system as claimed in claim 11 the operations further comprising, after the training selecting actions to be performed by the reinforcement learning agent using the second neural network and in accordance with trained parameter values of the second neural network. 13. A system as claimed in claim 11 , the operations further comprising generating transitions by storing data defining actions selected using the second neural network in association with data defining respective said starting states and next states for the selected actions. 14. A system as claimed in claim 11 wherein the first neural network and the second neural network receive an input comprising state data and generate as output a respective Q-value for each of one or more of the actions in the set of actions, the operations further comprising, for each transition, generating the target output by providing the data defining the action and the next state data as input to the first neural network. 15. A system as claimed in claim 14 the operations further comprising: providing the second neural network with particular state data characterizing a particular state of the subject system; retrieving from the second neural network a respective Q-value for each action of the set of actions; and selecting an action to be performed by the reinforcement learning agent having a maximum or minimum Q-value as generated by the second neural network. 16. A system as claimed in claim 11 wherein the transitions in the training data are generated using the second neural network. 17. A system as claimed in claim 11 , wherein the training comprises, for each transition: providing the first neural network with the next state data; determining, from the first neural network, a maximum or minimum Q-value for the next state; determining the target output for the transition from the maximum or minimum Q-value for the next state. 18. A system as claimed in claim 17 wherein the training further comprises, for each transition: provid

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • A63F13/67Primary

    adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9679258B2 cover?
We describe a method of reinforcement learning for a subject system having multiple states and actions to move from one state to the next. Training data is generated by operating on the system with a succession of actions and used to train a second neural network. Target values for training the second neural network are derived from a first neural network which is generated by copying weights o…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification A63F13/67. Mapped technology areas include Human Necessities.
When was this patent published?
Publication date Tue Jun 13 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).