Enriching feature maps using multiple pluralities of windows to generate bounding boxes
US-2024062520-A1 · Feb 22, 2024 · US
US12528501B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12528501-B2 |
| Application number | US-202318326922-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 31, 2023 |
| Priority date | May 31, 2023 |
| Publication date | Jan 20, 2026 |
| Grant date | Jan 20, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Autonomous vehicles utilize perception and understanding of vehicles to predict behaviors of the vehicles, and to plan a trajectory. Understanding of attributes of vehicles may be improved through sensor fusion. Sensor fusion can be computationally expensive and may be difficult to implement in a real-time vehicle understanding system. To limit computational complexity while benefiting from machine learning across modalities, sensor fusion may be selectively implemented for a subset of task groups of a multi-task machine learning model. In some cases, part-based understanding may be implemented before fusion to limit the features being fused together to part features that are most salient for the task group. In addition, sensor data and features that may be fused together can be limited to sensor data and features within a desired field of view. A model that implements sensor fusion may be disabled for objects that are beyond a threshold distance.
Opening claim text (preview).
What is claimed is: 1 . A vehicle comprising: sensors to generate first sensor data in a first modality and second sensor data in a second modality; one or more processors; and one or more storage media encoding instructions executable by the one or more processors to implement an understanding part, wherein the understanding part includes: a first node to output first inferences for a plurality of first task groups, the first node including: a first shared backbone to receive and process first sensor data corresponding to tracked objects having a vehicle classification; and task group specific heads to output first inferences for the first task groups; and a second node to output second inferences for a second task group, the second node including: a second backbone to receive and process second sensor data corresponding to tracked objects having the vehicle classification; a cross attention neural network to receive first machine learning features from the first shared backbone and second machine learning features from the second backbone; and heads downstream of the cross attention neural network to output inferences for the second task group; and the one or more storage media further encoding instructions for causing the vehicle to: extract a first set of machine learning features from first sensor data using the first backbone, and determine a set of first inferences based on the first set of machine learning features using the first backbone; extract machine learning features from a second set of sensor data using the second backbone, fuse the first set of machine learning features and the second set of machine learning features using the cross attention neural network and determining a second set of inferences from the fusion of first machine learning features and second machine learning features; planning a trajectory of the vehicle using the first inferences and the second inferences; and automatically implementing the planned trajectory by engaging at least one of a vehicle propulsion system, a braking system, and a steering system. 2 . The vehicle of claim 1 , wherein the first node further includes: a plurality of first temporal networks dedicated to respective first task groups. 3 . The vehicle of claim 1 , wherein the second node further includes: a second temporal network downstream of the cross attention neural network. 4 . The vehicle of claim 1 , wherein the second inferences comprise two or more vehicle open door attributes. 5 . The vehicle of claim 1 , wherein the first sensor data comprises image data generated by a camera, and second sensor data comprises point clouds generated by a light detection and ranging sensor. 6 . The vehicle of claim 1 , wherein the second inferences comprise two or more vehicle signal attributes. 7 . The vehicle of claim 1 , wherein the first sensor data comprises color channels image data generated by a camera, and second sensor data comprises signal channel image data generated by the camera. 8 . The vehicle of claim 1 , wherein the first sensor data comprises color image data generated by a first camera, and second sensor data comprises signal image data generated by a second camera. 9 . The vehicle of claim 1 , wherein the first task groups comprise two or more of: a first task group to extract an emergency vehicle classification, extract emergency vehicle subtype classifications, and extract one or more emergency vehicle flashing light attributes, a second task group to extract vehicle signal attributes, a third task group to extract school bus classification, extract one or more school bus flashing light attributes, and extract one or more school bus activeness attributes, a fourth task group to extract vehicle subtype classifications and extract one or more vehicle attributes, and a fifth task group to extract vehicle subtype classifications. 10 . The vehicle of claim 1 , wherein the cross attention neural network encodes attention relationships between the first machine learning features and the second machine learning features, and outputs fused machine learning features based on the attention relationships. 11 . The vehicle of claim 1 , wherein the first shared backbone comprises a part-based backbone to output global machine learning features per frame of the sensor data, and one or more part machine learning features per frame of the sensor data. 12 . The vehicle of claim 11 , wherein the part-based backbone further outputs one or more bounding boxes corresponding to the one or more part machine learning features. 13 . The vehicle of claim 11 , wherein the first machine learning features received by the cross attention neural network comprise one or more selected part machine learning features generated by the part-based backbone. 14 . The vehicle of claim 11 , wherein the first node further comprises one or more task group specific masking filters to mask the one or more part machine learning features. 15 . The vehicle of claim 1 , wherein the second node is deactivated and does not perform processing of sensor data corresponding to tracked objects that are beyond a threshold distance from the vehicle. 16 . The vehicle of claim 1 , wherein the first set of sensor data has a first modality, the second set of sensor data has a second modality, and the first modality is distinct from the second modality.
Image sensing, e.g. optical camera · CPC title
using neural networks · CPC title
Data fusion · CPC title
Spatial relation or speed relative to objects · CPC title
using classification, e.g. of video objects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.