Control policy learning and vehicle control method based on reinforcement learning without active exploration

US10061316B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10061316-B2
Application numberUS-201715594020-A
CountryUS
Kind codeB2
Filing dateMay 12, 2017
Priority dateJul 8, 2016
Publication dateAug 28, 2018
Grant dateAug 28, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method is provided for autonomously controlling a vehicle to perform a vehicle operation. The method includes steps of applying a passive actor-critic reinforcement learning method to passively-collected data relating to the vehicle operation, to learn a control policy configured for controlling the vehicle so as to perform the vehicle operation with a minimum expected cumulative cost; and controlling the vehicle in accordance with the control policy to perform the vehicle operation.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for autonomously controlling a vehicle to perform a vehicle operation, the method comprising steps of: applying a passive actor-critic reinforcement learning method to passively-collected data relating to the vehicle operation, to adapt an existing control policy so as to enable control of the vehicle by the control policy so as to perform the vehicle operation with a minimum expected cumulative cost, the step of applying a passive actor-critic reinforcement learning method to passively-collected data including steps of: a) in a critic network, estimating a Z-value and an average cost under an optimal control policy using samples of the passively collected data; b) in an actor network operatively coupled to the critic network, revising the control policy using samples of the passively collected data, the estimated Z-value, and the estimated average cost under an optimal control policy from the critic network; and c) iteratively repeating steps (a)-(b) until the estimated average cost converges; and controlling the vehicle in accordance with the adapted control policy to perform the vehicle operation. 2. The method of claim 1 wherein the vehicle operation is an operation for merging the vehicle into a traffic lane between a second vehicle and a third vehicle traveling in the traffic lane, and wherein the control policy is configured for controlling the vehicle to merge the vehicle midway between the second vehicle and the third vehicle. 3. The method of claim 1 wherein the Z-value is estimated using a linearized version of a Bellman equation. 4. The method of claim 1 wherein the step of estimating the average cost under an optimal policy comprises the step of, prior to the step of revising the control policy, updating the average cost. 5. The method of claim 1 wherein the step of estimating a Z-value comprises the steps of: approximating a Z-value function using a linear combination of weighted radial basis functions; and approximating a Z-value using the approximated Z-value function and samples of the passively-collected data. 6. The method of claim 5 wherein the step of approximating a Z-value function using a linear combination of weighted radial basis functions comprises the step of optimizing weights used in the weighted radial basis functions. 7. The method of claim 6 wherein the step of approximating a Z-value function using a linear combination of weighted radial basis functions comprises the step of, prior to the step of optimizing the weights, updating the weights used in the weighted radial basis functions. 8. The method of claim 1 wherein the step of revising the control policy comprises steps of: approximating a control gain; optimizing the control gain to provide an optimized control gain; and revising the control policy using the optimized control gain. 9. The method of claim 8 further comprising the steps of, prior to optimizing the control gain: determining a control input; and determining a value of an action-value function using the control input, samples of the passively-collected data, and the approximated control gain. 10. The method of claim 8 wherein the step of approximating a control gain comprises the step of approximating the control gain using a linear combination of weighted radial basis functions. 11. The method of claim 10 further comprising the step of, prior to the step of approximating the control gain using a linear combination of weighted radial basis functions, updating weights used in the weighted radial basis functions. 12. A computer-implemented method for optimizing a control policy usable for controlling a system to perform an operation, the method comprising steps of: providing a control policy usable for controlling the system; and applying a passive actor-critic reinforcement learning method to passively-collected data relating to the operation to be performed, to revise the control policy such that the control policy is operable to control the system to perform the operation with a minimum expected cumulative cost, wherein the step of applying a passive actor-critic reinforcement learning method to passively-collected data includes steps of: a) in a critic network, estimating a Z-value using samples of the passively-collected data, and estimating an average cost under an optimal policy using samples of the passively-collected data; b) in an actor network, revising the control policy using samples of the passively-collected data, a control dynamics for the system, a cost-to-go, and a control gain; c) updating parameters used in revising the control policy and in estimating the Z-value and the average cost under an optimal policy; and d) iteratively repeating steps (a)-(c) until the estimated average cost converges. 13. A computing system configured for optimizing a control policy usable for autonomously controlling a vehicle to perform a vehicle operation, the computing system including one or more processors for controlling operation of the computing system, and a memory for storing data and program instructions usable by the one or more processors, wherein the memory is configured to store computer code that, when executed by the one or more processors, causes the one or more processors to: a) receive passively-collected data relating to the vehicle operation; b) determine a Z-value function usable for estimating a cost-to-go for the vehicle; c) in a critic network in the computing system: c1) determine a Z-value using the Z-value function and samples of the passively-collected data; c2) estimate an average cost under an optimal policy using samples of the passively-collected data d) in an actor network in the computing system, revise the control policy using samples of the passively-collected data; a control dynamics for the vehicle; a cost-to-go, and a control gain; and e) iteratively repeat steps (c) and (d) until the estimated average cost converges.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • Reinforcement learning · CPC title

  • Feedforward networks · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10061316B2 cover?
A computer-implemented method is provided for autonomously controlling a vehicle to perform a vehicle operation. The method includes steps of applying a passive actor-critic reinforcement learning method to passively-collected data relating to the vehicle operation, to learn a control policy configured for controlling the vehicle so as to perform the vehicle operation with a minimum expected cu…
Who is the assignee on this patent?
Toyota Eng & Mfg North America
What technology area does this patent fall under?
Primary CPC classification G05D1/0088. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 28 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).