Systems and methods for vision-language planning (VLP) foundation models for autonomous driving

US12528507B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12528507-B2
Application numberUS-202318388606-A
CountryUS
Kind codeB2
Filing dateNov 10, 2023
Priority dateNov 10, 2023
Publication dateJan 20, 2026
Grant dateJan 20, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for training an autonomous driving system using a vision-language planning (VLP) model. Image data is obtained from a vehicle-mounted camera, encompassing details about agents situated within the external environment. Via image processing, the system identifies these agents within the environment. A Bird's Eye View (BEV) representation of the surroundings is then generated, encapsulating the spatiotemporal information linked to the vehicle and the recognized agents. Execution of the VLP machine learning model begins by extracting vision-based planning features from the BEV, and receiving or generating textual information characterizing various attributes of the vehicle within the environment. Text-based planning features are extracted from this textual information. To enhance model performance, a contrastive learning model is engaged to establish similarities between the vision-based and text-based planning features, and a predicted trajectory is output based on the similarities.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of training an autonomous driving system utilizing a vision-language planning (VLP) machine learning model, the method comprising: receiving image data generated from a camera mounted to a vehicle, wherein the image data includes agents in an environment outside the vehicle; via image processing, detecting the agents in the environment based on the image data; generating a bird eye view (BEV) of the environment based on the image data, wherein the BEV includes spatiotemporal information associated with the vehicle and the detected agents; and executing a vision-language planning (VLP) machine learning model to: extract vision-based planning features from the BEV, wherein the vision-based planning features include the spatiotemporal information associated with the vehicle, generate text information associated with the environment, wherein the text information describes qualities of the vehicle in the environment; extract text-based planning features from the text information, execute a contrastive learning model to derive similarities between the vision-based planning features and the text-based planning features, and generate a predicted trajectory of the vehicle based on the similarities. 2 . The method of claim 1 , further comprising: determining a loss between the predicted trajectory of the vehicle and a ground truth trajectory of the vehicle; and repeat the steps of claim 1 until convergence to minimize the loss. 3 . The method of claim 1 , wherein the text information is generated using a template and ground truth information associated with the environment existent in training data. 4 . The method of claim 1 , wherein an output of the contrastive learning is used as a training loss for training the VLP machine learning model. 5 . The method of claim 1 , wherein the contrastive learning model includes: a text encoder configured to output a text-based vector representing text-based features associated with the text information of the environment; and an image encoder configured to output an image-based vector representing image-based features associated with the detected agents in the BEV. 6 . The method of claim 5 , wherein the contrastive learning model is further configured to execute a dot product to evaluate similarities between the text-based vector and the image-based vector. 7 . The method of claim 1 , wherein the contrastive learning model is further configured to push apart dissimilarities between the vision-based planning features and the text-based planning features. 8 . The method of claim 1 , wherein the agents in the environment include at least one of a pedestrian, another vehicle, or a cyclist. 9 . A system utilizing a vision-language planning (VLP) machine learning model, the system comprising: a camera mounted to a vehicle and configured to generate image data associated with agents in an environment outside the vehicle; a processor; and memory including instructions that, when executed by the processor, cause the processor to: process the image data to detect the agents in the environment, generate a bird eye view (BEV) of the environment based on the image data, wherein the BEV includes spatiotemporal information associated with the vehicle and the detected agents, and execute a vision-language planning (VLP) machine learning model to: extract vision-based planning features from the BEV, wherein the vision-based planning features include the at least some of the spatiotemporal information associated with the vehicle, receive text information associated with the environment, wherein the text information describes qualities of the vehicle in the environment, extract text-based planning features from the text information, execute a contrastive learning model to derive similarities between the vision-based planning features and the text-based planning features, and generate a predicted trajectory of the vehicle based on the similarities. 10 . The system of claim 9 , wherein the memory includes further instructions that, when executed by the processor, cause the processor to: determine a loss between the predicted trajectory of the vehicle and a ground truth trajectory of the vehicle; and execute the VLP model until convergence to minimize the loss. 11 . The system of claim 9 , wherein the text information is generated using a template and ground truth information associated with the environment existent in training data. 12 . The system of claim 9 , wherein an output of the contrastive learning is used as a training loss for training the VLP machine learning model. 13 . The system of claim 9 , wherein the contrastive learning model includes: a text encoder configured to output a text-based vector representing text-based features associated with the text information of the environment; and an image encoder configured to output an image-based vector representing image-based features associated with the detected agents in the BEV. 14 . The system of claim 13 , wherein the contrastive learning model is further configured to execute a dot product to evaluate similarities between the text-based vector and the image-based vector. 15 . The system of claim 9 , wherein the contrastive learning model is further configured to push apart dissimilarities between the vision-based planning features and the text-based planning features. 16 . The system of claim 9 , wherein the agents in the environment include at least one of a pedestrian, another vehicle, or a cyclist. 17 . A method of training an autonomous driving system, the method comprising: receiving image data generated from a camera mounted to a vehicle, wherein the image data includes agents in an environment outside the vehicle; generating a bird eye view (BEV) of the environment based on the image data, wherein the BEV includes spatiotemporal information associated with the vehicle and the agents; based on the BEV, executing a perception model to detect the agents in the environment and associated information about the detected agents; based on the BEV, executing a prediction model to estimate trajectories of the detected agents; based on the BEV, executing a vision-language planning (VLP) model to output a predicted trajectory of the vehicle, wherein the VLP model is configured to: extract vision-based planning features from the BEV, wherein the vision-based planning features include the spatiotemporal information associated with the vehicle, receive text information associated with the environment, wherein the text information describes qualities of one or more of the agents in the environment, extract text-based planning features from the text information, perform contrastive learning to derive similarities between the vision-based planning features and the text-based planning features, and output the predicted trajectory based on the similarities. 18 . The method of claim 17 , wherein the text information is generated using a template and ground truth information associated with the environment existent in training data. 19 . The method of claim 17 , wherein an output of the contrastive learning is used as a training loss for training the VLP machine learning model. 20 . The method of claim 17 , wherein the agents in the environment include at least one of a pedestrian, another vehicle, or a cyclist.

Assignees

Inventors

Classifications

  • Image sensing, e.g. optical camera · CPC title

  • G06V20/56Primary

    exterior to a vehicle by using sensors mounted on the vehicle · CPC title

  • specially adapted for safety · CPC title

  • Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12528507B2 cover?
Methods and systems for training an autonomous driving system using a vision-language planning (VLP) model. Image data is obtained from a vehicle-mounted camera, encompassing details about agents situated within the external environment. Via image processing, the system identifies these agents within the environment. A Bird's Eye View (BEV) representation of the surroundings is then generated, …
Who is the assignee on this patent?
Bosch Gmbh Robert
What technology area does this patent fall under?
Primary CPC classification G06V20/56. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).