Augmenting of driving scenarios using contrastive learning
US-2025156685-A1 · May 15, 2025 · US
US12597262B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12597262-B2 |
| Application number | US-202217899734-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 31, 2022 |
| Priority date | Jul 13, 2022 |
| Publication date | Apr 7, 2026 |
| Grant date | Apr 7, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system includes a first sensor system of a first modality and a second sensor system of a second modality. The system further includes a computing system that is configured to detect and identify objects represented in sensor signals output by the first and second sensor systems. The computing system employs a hierarchical arrangement of transformers to fuse features of first sensor data output by the first sensor system and second sensor data output by the second sensor system.
Opening claim text (preview).
What is claimed is: 1 . A system comprising: a first sensor system that generates first sensor data, the first sensor data corresponding to a first modality; a second sensor system that generates second sensor data, the second sensor data corresponding to a second modality; a computing system that is in communication with the first sensor system and the second sensor system, wherein the computing system comprises: a processor; and memory that stores computer-executable instructions that, when executed by the processor, cause the processor to perform acts comprising: generating, by a first transformer, a first output based upon the first sensor data, wherein the first output comprises identities of objects determined by the first transformer to be represented in the first sensor data and corresponding locations of the objects in the first sensor data; generating, by a second transformer, a second output based upon the second sensor data, wherein the second output comprises identities of objects determined by the second transformer to be represented in the second sensor data and corresponding locations of the objects in the second sensor data; and generating, by a third transformer, a third output based upon the first output and the second output, wherein the third transformer comprises an encoder and a decoder, wherein the encoder processes the second output and the decoder, using cross-attention and self-attention, processes the first output, wherein the cross-attention correlates objects between the first output and the second output, wherein the self-attention evaluates consistency of relationships between the correlated objects, and wherein the third output comprises identities of objects determined by the third transformer to be in an environment of the system and corresponding locations of the objects in the environment of the system. 2 . The system of claim 1 , wherein the first sensor system is a camera and the second sensor system is a radar sensor system. 3 . The system of claim 1 , wherein at least one of the first sensor system or the second sensor system is a lidar system. 4 . The system of claim 1 , the acts further comprising: extracting features from the first sensor data; and providing the features and positional encodings to the first transformer, wherein the first transformer generates the first output based upon the extracted features and the positional encodings. 5 . The system of claim 4 , the acts further comprising: extracting second features from the second sensor data; and providing the second features and second positional encodings to the second transformer, wherein the second transformer generates the second output based upon the extracted second features and the second positional encodings. 6 . The system of claim 1 , wherein the first sensor data is an image and the second sensor data is a point cloud. 7 . The system of claim 1 , wherein: the first transformer outputs the first output comprising first vectors, where each vector in the first vectors corresponds to a respective region in the first sensor data and each vector in the first vectors indicates a type of object predicted as being included in the region in the first sensor data; the second transformer outputs the second output comprising second vectors, wherein each vector in the second vectors corresponds to a respective region in the second sensor data and each vector in the second vectors indicates a type of object predicted as being included in the region in the second sensor data; and the third transformer receives the first vectors and the second vectors as input data and outputs the third output comprising third vectors, where each vector in the third vectors corresponds to a respective region in the environment of the system and each vector in the third vectors indicates a type of object predicted as being included in the region in the environment of the system. 8 . A method performed by a computing system, the method comprising: generating, by a first transformer, a first output based upon first sensor data generated by a first sensor system, wherein the first output comprises identities of objects determined by the first transformer to be represented in the first sensor data and corresponding locations of the objects in the first sensor data, and further wherein the first sensor data is in a first modality; generating, by a second transformer, a second output based upon second sensor data generated by a second sensor system, wherein the second output comprises identities of objects determined by the second transformer to be represented in the second sensor data and corresponding locations of the objects in the second sensor data, and further wherein the second sensor data is in a second modality; and generating, by a third transformer, a third output based upon the first output and the second output, wherein the third transformer comprises an encoder and a decoder, wherein the encoder processes the second output and the decoder, using cross-attention and self-attention, processes the first output, wherein the cross-attention correlates objects between the first output and the second output, wherein the self-attention evaluates consistency of relationships between the correlated objects, and wherein the third output comprises identities of objects determined by the third transformer to be in an environment of the first sensor system and the second sensor system and corresponding locations of the objects in the environment of the first sensor system and the second sensor system. 9 . The method of claim 8 , wherein the first sensor system is a camera and the second sensor system is a radar sensor system. 10 . The method of claim 8 , wherein at least one of the first sensor system or the second sensor system is a lidar system. 11 . The method of claim 8 , further comprising: extracting features from the first sensor data; and providing the features and positional encodings to the first transformer, wherein the first transformer generates the first output based upon the extracted features and the positional encodings. 12 . The method of claim 11 , further comprising: extracting second features from the second sensor data; and providing the second features and second positional encodings to the second transformer, wherein the second transformer generates the second output based upon the extracted second features and the second positional encodings. 13 . The method of claim 8 , wherein the first sensor data is an image and the second sensor data is a point cloud. 14 . The method of claim 8 , wherein: the first transformer outputs the first output comprising first vectors, where each vector in the first vectors corresponds to a respective region in the first sensor data and each vector in the first vectors indicates a type of object predicted as being included in the region in the first sensor data; the second transformer outputs the second output comprising second vectors, wherein each vector in the second vectors corresponds to a respective region in the second sensor data and each vector in the second vectors indicates a type of object predicted as being included in the region in the second sensor data; and the third transformer receives the first vectors and the second vectors as input data and outputs the third output comprising third vectors, where each vector in the third vectors corresponds to a respective region in the environment of the first sensor system and the second sensor system and each vector in the third vectors indicates a type of object predicted as being included in the reg
of land vehicles · CPC title
Combinations of lidar systems with systems other than lidar, radar or sonar, e.g. with direction finders · CPC title
of land vehicles · CPC title
of aircraft or spacecraft · CPC title
of marine craft · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.