Multi-modal encoder channel fusion with cross-modality awareness
US-2025029355-A1 · Jan 23, 2025 · US
US2025054286A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025054286-A1 |
| Application number | US-202318475988-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 27, 2023 |
| Priority date | Aug 7, 2023 |
| Publication date | Feb 13, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An image processing method includes performing, using images obtained from one or more sensors onboard a vehicle, a 2-dimensional (2D) feature extraction; performing, a 3-dimensional (3D) feature extraction on the images; detecting objects in the images by fusing detection results from the 2D feature extraction and the 3D feature extraction.
Opening claim text (preview).
1 . A method of detecting objects in sensor data, comprising: performing, using images obtained from one or more sensors onboard a vehicle, a 2-dimensional (2D) feature extraction; performing, a 3-dimensional (3D) feature extraction on the images; detecting objects in the images by fusing detection results from the 2D feature extraction and the 3D feature extraction. 2 . The method of claim 1 , wherein the 2D feature extraction comprises a perspective view (PV) analysis of the images. 3 . The method of claim 1 , wherein the 3D feature extraction comprises a bird's eye view (BEV) analysis of the images. 4 . The method of any claim 1 , wherein the 3D feature extraction is performed by: generating 3D features from 2D features resulting from the 2D feature extraction; and the method further includes refining 3D feature estimates using dual-space object queries that include joint proposals formed based on 2D features resulting from the 2D feature extraction and 3D features resulting from the 3D feature extraction. 5 . The method of claim 4 , wherein the generating the 3D features from the 2D features comprises applying a back-projection model to the 2D features. 6 . The method of claim 4 , wherein the refining comprises performing a multi-level refinement wherein, at each layer of the multi-level refinement, a self-attention layer that acts on both the 2D features and the 3D features a first cross-attention layer that acts only on the 2D features and a second cross-attention layer that acts only on the 3D features are used. 7 . The method of claim 4 , wherein a shared pose is further used during the refining. 8 . The method of claim 1 , wherein the 2D feature extraction method comprises a 3D objection method. 9 . An apparatus for detecting objects in sensor data, the apparatus comprising at least one processor configured to: perform, from the sensor data, a 2-dimensional (2D) feature extraction; perform, from the sensor data, a 3-dimensional (3D) feature extraction; detect objects in the images by fusing detection results from the 2D feature extraction and the 3D feature extraction. 10 . The apparatus of claim 9 , wherein the 2D feature extraction comprises a perspective view (PV) analysis of the images and the 3D feature extraction comprises a bird's eye view (BEV) analysis of the images. 11 . The method of claim 9 , wherein the at least one processor performs the 3D feature extraction by: generating 3D features from 2D features resulting from the 2D feature extraction; and the method further includes refining 3D feature estimates using dual-space object queries that include joint proposals formed based on 2D features resulting from the 2D feature extraction and 3D features resulting from the 3D feature extraction. 12 . The apparatus of claim 11 , wherein the generating the 3D features from the 2D features comprises applying a back-projection model to the 2D features. 13 . The apparatus of claim 11 , wherein the refining is performed using a Conv3D or a Conv2D algorithm. 14 . The apparatus of claim 11 , wherein the refining comprises performing a multi-level refinement wherein, at each layer of the multi-level refinement, a self-attention layer that acts on both the 2D features and the 3D features a first cross-attention layer that acts only on the 2D features and a second cross-attention layer that acts only on the 3D features are used. 15 . The apparatus of claim 9 , wherein the 2D feature extraction method comprises a 3D objection method. 16 . The apparatus of claim 9 , wherein the 3D feature extraction method comprises a dense segmentation and/or a detection method. 17 . A system for deployment on an autonomous vehicle, comprising: one or more sensors configured to generate sensor data of an environment of the autonomous vehicle; and at least one processor configured to detect objects in the sensor data by: performing, using the sensor data obtained from one or more sensors, a 2-dimensional (2D) feature extraction; performing, a 3-dimensional (3D) feature extraction on the sensor data; detecting objects in the images by fusing detection results from the 2D feature extraction and the 3D feature extraction. 18 . The system of claim 17 , wherein the one or more processor performs the 3D feature extraction by: generating 3D features from 2D features resulting from the 2D feature extraction; and refining 3D feature estimates using dual-space object queries that include joint proposals formed based on 2D features resulting from the 2D feature extraction and 3D features resulting from the 3D feature extraction. 19 . The system of claim 17 , wherein hybrid detection proposals are used for querying for the objects. 20 . The system of claim 17 , wherein the one or more sensors include a camera and a lidar.
exterior to a vehicle by using sensors mounted on the vehicle · CPC title
of extracted features · CPC title
using neural networks · CPC title
of extracted features · CPC title
Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.