Automatic Navigation Using Deep Reinforcement Learning
US-2019299978-A1 · Oct 3, 2019 · US
US10940863B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10940863-B2 |
| Application number | US-201816177834-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 1, 2018 |
| Priority date | Nov 1, 2018 |
| Publication date | Mar 9, 2021 |
| Grant date | Mar 9, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are provided that employ spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle. An actor-critic network architecture includes an actor network that process image data received from an environment to learn the lane-change policies as a set of hierarchical actions, and a critic network that evaluates the lane-change policies to calculate loss and gradients to predict an action-value function (Q) that is used to drive learning and update parameters of the lane-change policies. The actor-critic network architecture implements a spatial attention module to select relevant regions in the image data that are of importance, and a temporal attention module to learn temporal attention weights to be applied to past frames of image data to indicate relative importance in deciding which lane-change policy to select.
Opening claim text (preview).
What is claimed is: 1. A method for learning lane-change policies via an actor-critic network architecture, wherein each lane-change policy describes one or more actions selected to be taken by an autonomous vehicle, the method comprising: processing, via an actor network over time, image data received from an environment to learn the lane-change policies as a set of hierarchical actions, wherein the lane-change policies each comprise a high-level action and associated low-level actions, wherein the high-level actions comprise: a left lane-change, lane following, and a right lane-change, and wherein each of the associated low-level actions comprises a steering angle command parameter and an acceleration-brake rate parameter; and predicting action values via an action-value function at a critic network; evaluating, via the critic network, a lane-change policy; calculating, via the critic network, loss and gradients to drive learning and update the critic network; wherein processing via the actor network at each particular time step comprises: processing, at a convolutional neural network (CNN) of the actor network, the image data to generate a feature map that comprises a machine-readable representation of the driving environment that includes features of the environment acquired at the particular time step; processing, at a spatial attention module of the actor network, the feature map to select relevant regions in the image data that are of importance to focus on for computing actions when making lane-changes while driving; learning, at the spatial attention module, importance weights for each of the relevant regions of the image data; applying, at the spatial attention module, the learned importance weights to each of the relevant regions of the image data to add importance to the relevant regions of the image data; generating, at the spatial attention module, a spatial context vector; and processing, at a temporal attention module of the actor network, the spatial context vector to learn temporal attention weights that are applied to past frames of image data to indicate relative importance of the past frames; generating, at the temporal attention module, a combined context vector; and processing, via at least one fully connected layer, the combined context vector to generate the set of hierarchical actions. 2. The method according to claim 1 , wherein processing, via the actor network over time, the image data received from the environment, comprises: processing the image data received from the environment to learn the lane-change policies as the set of hierarchical actions that are represented as a vector of a probability of action choices and a first set of parameters coupled to each discrete hierarchical action, and wherein predicting the action values via the action-value function at the critic network, comprises: predicting action values via the action-value function at the critic network using a second set of parameters, wherein the action-value function is represented as a neural network using the second set of parameters; wherein evaluating, via the critic network, the lane-change policy, comprises: evaluating, via the critic network based on transitions generated by the actor network, the lane-change policy, wherein the transitions comprise the image data, the hierarchical actions, rewards, and next image data generated by the actor network. 3. The method according to claim 2 , wherein the calculating, via the critic network, the loss and the gradients to drive learning and update the critic network, comprises: calculating, via the critic network, loss and gradients to drive learning and update the second set of parameters of the critic network, wherein the calculating, via the critic network, comprises: processing, at the critic network during a back-propagation mode, an obtained mini-batch of transitions comprising the image data, the hierarchical actions, rewards, next image data generated by the actor network; computing, at the critic network, first gradients of the action-value function by differentiating a loss of the critic network with respect to the second set of parameters, wherein the first gradients are gradients of an error in predicting the action-value function with respect to the second set of parameters, wherein the first gradients are to be used for updating for the second set of parameters of the critic network; updating the second set of parameters at the critic network based on the first gradients; computing, at the critic network, second gradients of the action-value function with respect to the hierarchical actions generated by the actor network by differentiating a loss of the critic network with respect to the hierarchical actions taken by the actor network; and further comprising: back-propagating the second gradients to the actor network; processing the second gradients at the actor network along with third gradients generated by the actor network to update the first set of parameters, wherein the third gradients are generated by differentiating a loss of the actor network with respect to the hierarchical actions taken by the actor network. 4. The method according to claim 1 , wherein the spatial attention module comprises: an attention network comprising at least one fully connected layer in which each neuron receives input from all activations of a previous layer; and an activation function coupled to the fully connected layer that coverts values into action probabilities, and wherein a set of region vectors are extracted from the feature map by the CNN, wherein each region vector corresponds to a different feature layer of features extracted from a different image region of the image data by the CNN; and wherein learning, at the spatial attention module, importance weights for each of the relevant regions of the image data, comprises: applying, at the attention network, the set of region vectors along with a previous hidden state vector that was generated by an LSTM network during a past time step, to learn an importance weight for each region vector of the set of region vectors; wherein applying, at the spatial attention module, the learned importance weights to each of the relevant regions of the image data to add importance to the relevant regions of the image data, comprises: applying, at the attention network, the learned importance weights to each region vector of the set of region vectors to add importance to each region vector of the set of region vectors in proportion to importance of that region vector as learned by the attention network, and wherein generating, at the spatial attention module, the spatial context vector, comprises: generating, at the attention network, the spatial context vector that is a lower dimensional weighted version of the set of the region vectors that is represented by a weighted sum of all of the set of the region vectors. 5. The method according to claim 4 , wherein the spatial attention module and the temporal attention module each comprise: a Long Short-Term Memory (LSTM) network of LSTM cells, wherein each LSTM cell processes input data sequentially and keeps a hidden state of that input data through time, and wherein the processing, at the temporal attention module of the actor network, the spatial context vector to learn temporal attention weights to be applied to past frames of image data to indicate relative importance in deciding which lane-change policy to select, comprises: processing, at the LSTM network at each time step, the spatial context vector for that time step and the previous hidden state vector that was generated by the LSTM network during the past time step to generate an LSTM output; learning, at the LSTM network, a temporal attention weight for each LSTM output at each time
Combinations of networks · CPC title
Activation functions · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Reinforcement learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.