Reinforcement learning using target neural networks

US11049008B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11049008-B2
Application numberUS-201715619393-A
CountryUS
Kind codeB2
Filing dateJun 9, 2017
Priority dateOct 8, 2013
Publication dateJun 29, 2021
Grant dateJun 29, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

We describe a method of reinforcement learning for a subject system having multiple states and actions to move from one state to the next. Training data is generated by operating on the system with a succession of actions and used to train a second neural network. Target values for training the second neural network are derived from a first neural network which is generated by copying weights of the second neural network at intervals.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of reinforcement learning performed by one or more computers, the method comprising: obtaining training data relating to a subject system being interacted with by a reinforcement learning agent that performs actions from a set of actions to cause the subject system to change states; wherein the training data is generated by performing a succession of actions from the set of actions to interact with the environment and comprises a plurality of transitions each including respective starting state data defining a respective starting state of the environment, respective action data defining a respective action from the set of actions, and respective next state data defining respective next state resulting from the respective action being performed; and training a second neural network using the training data and target values for the second neural network derived from a first neural network; and updating the first neural network from the second neural network. 2. A method as claimed in claim 1 further comprising selecting the actions performed to generate the training data using learnt action-value parameters obtained as output from the second neural network, wherein the actions are selected responsive to respective action-value parameters for each action of a set of actions available at a state of the subject system that are generated by the second neural network. 3. A method as claimed in claim 2 , the method further comprising generating the transitions by storing data defining actions selected using the second neural network in association with data defining respective starting states and next states for the actions. 4. A method as claimed in claim 3 further comprising generating the target values by providing at least the data defining the next states as input to the first neural network, and training the second neural network using the target values and the data defining the starting states. 5. A method as claimed in claim 4 further comprising: selecting a first transition comprising first starting state data, first action data, and first next state data; providing the first neural network with a representation of the first next state data; determining, from the first neural network, a maximum or minimum learnt action-value parameter for the next state data; determining a first target value for training the second neural network from the maximum or minimum learnt action-value parameter for the next state data. 6. A method as claimed in claim 5 wherein the training of the second neural network comprises providing the second neural network with a representation of the first starting state data and adjusting weights of the second neural network to bring a learnt action-value parameter for an action defined by the first action data closer to the first target value. 7. A method as claimed in claim 5 wherein the first transition further comprises reward data defining a reward value or cost value resulting from an action defined by the first action data being performed, and wherein determining the target value comprises adjusting the maximum or minimum learnt action-value parameter for the first next state data by the reward value or the cost value. 8. A method as claimed in claim 2 wherein selecting the actions further comprises: obtaining state data defining an input state of the subject system; providing, as input to the second neural network, a representation of the input state of the system; obtaining, as output from the second neural network, a respective learnt action-value parameter for each action of the set of actions that is available at the input state; and selecting an action to perform having a maximum or minimum respective learnt action-value parameter. 9. A method as claimed in claim 2 wherein the training of second neural network alternates with the selecting actions and comprises incrementally updating a set of weights of the second neural network used for the selecting actions. 10. A method as claimed in claim 9 wherein the updating the first neural network from the second neural network is performed at intervals after repeated selecting of actions using the second neural network and training of the second neural network. 11. A method as claimed in claim 10 wherein updating the first neural network from the second neural network comprises copying a set of weights of the second neural network to the first neural network. 12. A method as claimed in claim 1 wherein a state of the subject system comprises a sequence of observations of the subject system over time representing a history of the subject system. 13. A method as claimed in claim 1 wherein each state is defined by image data. 14. A method as claimed in claim 1 wherein the first and second neural networks comprise deep neural networks with a convolutional neural network input stage. 15. A system for reinforcement learning, the system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: obtaining training data relating to a subject system being interacted with by a reinforcement learning agent that performs actions from a set of actions to cause the subject system to change states; wherein the training data is generated by performing a succession of actions from the set of actions to interact with the environment and comprises a plurality of transitions each including respective starting state data defining a respective starting state of the environment, respective action data defining a respective action from the set of actions, and respective next state data defining a respective next state resulting from the respective action being performed; and training a second neural network using the training data and target values for the second neural network derived from a first neural network; and updating the first neural network from the second neural network. 16. A system as claimed in claim 15 the operations further comprising selecting the actions performed to generate the training data using learnt action-value parameters obtained as output from the second neural network, wherein the actions are selected responsive to respective action-value parameters for each action of a set of actions available at a state of the subject system that are generated by the second neural network. 17. A system as claimed in claim 16 , the operations further comprising generating the transitions by storing data defining actions selected using the second neural network in association with data defining respective starting states and next states for the actions. 18. A system as claimed in claim 17 the operations further comprising generating the target values by providing at least the data defining the next states as input to the first neural network, and training the second neural network using the target values and the data defining the starting states. 19. A system as claimed in claim 18 the operations further comprising: selecting a first transition comprising first starting state data, first action data, and first next state data; providing the first neural network with a representation of the first next state data; determining, from the first neural network, a maximum or minimum learnt action-value parameter for the next state data; determining a first target value for training the second neural network from the maximum or minimum learnt action-value parameter for the next state data

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • A63F13/67Primary

    adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11049008B2 cover?
We describe a method of reinforcement learning for a subject system having multiple states and actions to move from one state to the next. Training data is generated by operating on the system with a succession of actions and used to train a second neural network. Target values for training the second neural network are derived from a first neural network which is generated by copying weights o…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 29 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).