Hybrid reinforcement learning for autonomous driving
US-2020150672-A1 · May 14, 2020 · US
US12444244B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12444244-B2 |
| Application number | US-202218145557-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 22, 2022 |
| Priority date | Jun 23, 2020 |
| Publication date | Oct 14, 2025 |
| Grant date | Oct 14, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure relates to driving decision-making methods, apparatuses, and chips. One example method includes building a Monte Carlo tree based on a current driving environment state, where the Monte Carlo tree includes a root node and N−1 non-root nodes, each node represents one driving environment state, and a driving environment state represented by any non-root node is predicted by a stochastic model of driving environments. Based on at least one of an access count or a value function of each node in the Monte Carlo tree, a node sequence that starts from the root node and ends at a leaf node is determined, and a driving action sequence is determined based on a driving action corresponding to each node in the node sequence.
Opening claim text (preview).
What is claimed is: 1. A driving decision-making method, comprising: obtaining, by an autonomous driving vehicle, information of a current driving environment state of the autonomous driving vehicle, wherein the autonomous driving vehicle includes one or more sensors; constructing, by the autonomous driving vehicle, a Monte Carlo tree based on the current driving environment state, wherein the Monte Carlo tree comprises N nodes, each node represents a corresponding driving environment state, the N nodes comprise a root node and N−1 non-root nodes, the root node represents the current driving environment state, a first driving environment state represented by a first node is predicted by using a stochastic model of driving environments based on a second driving environment state represented by a parent node of the first node and based on a driving action, the driving action is determined by the parent node of the first node in a process of obtaining the first node through expansion, the first node is any node of the N−1 non-root nodes, and N is a positive integer greater than or equal to 2; determining, by the autonomous driving vehicle, in the Monte Carlo tree based on at least one of an access count or a value function of each node in the Monte Carlo tree, a node sequence, wherein the node sequence comprises a plurality of nodes that starts from the root node and ends at a leaf node; in response to determining the node sequence, determining, by the autonomous driving vehicle, a driving action sequence of a plurality of future driving steps, wherein each future driving step in the driving action sequence comprises a driving action corresponding to each node comprised in the node sequence, and wherein the driving action sequence is used by the autonomous driving vehicle for driving decision-making; autonomously driving, by the autonomous driving vehicle, the autonomous driving vehicle based on a first driving action in the driving action sequence; obtaining, by the autonomous driving vehicle, an actual driving environment state after the first driving action is executed; and updating, by the autonomous driving vehicle, the stochastic model of driving environments based on the current driving environment state, the first driving action, and the actual driving environment state, wherein the access count of each node is determined based on access counts of subnodes of the each node and an initial access count of the each node, the value function of the each node is determined based on value functions of subnodes of the each node and an initial value function of the each node, the initial access count of the each node is 1, and the initial value function of the each node is determined based on a value function that matches the corresponding driving environment state represented by the each node. 2. The method according to claim 1 , wherein that the first driving environment state represented by the first node is predicted by using the stochastic model of driving environments based on the second driving environment state represented by the parent node of the first node and based on the driving action comprises: predicting, through dropout-based forward propagation by using the stochastic model of driving environments, a probability distribution of a driving environment state after the driving action is executed based on the second driving environment state represented by the parent node of the first node; and obtaining the first driving environment state represented by the first node through sampling from the probability distribution. 3. The method according to claim 1 , wherein that the initial value function of the node is determined based on the value function that matches the driving environment state represented by the node comprises: selecting, from an episodic memory, a first quantity of target driving environment states that have a highest matching degree with the driving environment state represented by the node; and determining the initial value function of the node based on value functions respectively corresponding to the first quantity of target driving environment states. 4. The method according to claim 3 , wherein the method further comprises: when a driving episode ends, determining a cumulative reward return value corresponding to an actual driving environment state after each driving action in the driving episode is executed; and updating the episodic memory by using, as a value function corresponding to the actual driving environment state, the cumulative reward return value corresponding to the actual driving environment state after each driving action is executed. 5. The method according to claim 1 , wherein the node sequence is determined: based on the access count of the each node in the Monte Carlo tree according to a maximum access count rule; based on the value function of the each node in the Monte Carlo tree according to a maximum value function rule; or based on the access count and the value function of the each node in the Monte Carlo tree according to a “maximum access count first, maximum value function next” rule. 6. The method according to claim 1 , wherein obtaining, by the autonomous driving vehicle, information of the current driving environment state of the autonomous driving vehicle comprises: receiving, from a vehicle velocity sensor of the autonomous driving vehicle, a velocity of the autonomous driving vehicle. 7. The method according to claim 1 , obtaining, by the autonomous driving vehicle, information of the current driving environment state of the autonomous driving vehicle comprises: receiving, from an acceleration sensor of the autonomous driving vehicle, an acceleration of the autonomous driving vehicle. 8. The method according to claim 1 , wherein obtaining, by the autonomous driving vehicle, information of the current driving environment state of the autonomous driving vehicle comprises: receiving, from a distance sensor of the autonomous driving vehicle, a relative distance between the autonomous driving vehicle and another vehicle. 9. A driving decision-making apparatus in an autonomous driving vehicle, comprising: at least one processor; and one or more memories coupled to the at least one processor and storing programming instructions for execution by the at least one processor to: obtain information of a current driving environment state of the autonomous driving vehicle, wherein the autonomous driving vehicle includes one or more sensors; construct a Monte Carlo tree based on the current driving environment state, wherein the Monte Carlo tree comprises N nodes, each node represents a corresponding driving environment state, the N nodes comprise a root node and N−1 non-root nodes, the root node represents the current driving environment state, a first driving environment state represented by a first node is predicted by using a stochastic model of driving environments based on a second driving environment state represented by a parent node of the first node and based on a driving action, the driving action is determined by the parent node of the first node in a process of obtaining the first node through expansion, the first node is any node of the N−1 non-root nodes, and N is a positive integer greater than or equal to 2; determine, in the Monte Carlo tree based on at least one of an access count or a value function of each node in the Monte Carlo tree, a node sequence, wherein the node sequence comprises a plurality of nodes that starts from the root node and ends at a leaf node; in response to determining the node sequence, determine a driving action sequence of a plurality of future driving steps, wherein each driving future step in the driving action sequenc
Method for the design of a control system · CPC title
Historical data · CPC title
Mathematical models, e.g. for simulation · CPC title
State machine analysis · CPC title
Reinforcement learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.