Multi-modal encoder channel fusion with cross-modality awareness

US2025029355A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025029355-A1
Application numberUS-202318354074-A
CountryUS
Kind codeA1
Filing dateJul 18, 2023
Priority dateJul 18, 2023
Publication dateJan 23, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure provides systems, methods, and devices for vehicle driving assistance systems that support image processing. In a first aspect, a method includes receiving an image frame representing a scene; receiving point cloud data representing the scene; determining first sets of image frame features; determining second sets of point cloud data features based on a plurality of voxels representing the point cloud data; determining a third set of features of the image frame based on a first set of features of the plurality of first sets of features of the image frame and a second set of features of the plurality of second sets of features of the point cloud data; and outputting fused data that combines the third set of features of the image frame and a fourth set of features of the point cloud data. Other aspects and features are also claimed and described.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for image processing for use in a vehicle assistance system, comprising: receiving an image frame representing a scene; receiving point cloud data representing the scene; determining a plurality of first sets of features of the image frame; determining a plurality of second sets of features of the point cloud data based on a plurality of voxels representing the point cloud data; determining a third set of features of the image frame based on a first set of features of the plurality of first sets of features of the image frame and a second set of features of the plurality of second sets of features of the point cloud data; and outputting fused data that combines the third set of features of the image frame and a fourth set of features of the point cloud data. 2 . The method of claim 1 , further comprising: determining the first set of features and the second set of features based on a first statistical indicator associated with the plurality of first sets of features of the image frame and a second statistical indicator associated with the plurality of second sets of features of the point cloud data. 3 . The method of claim 1 , wherein each of the plurality of first sets of features of the image frame corresponds to a respective stage of a plurality of first stages of a first encoder, and wherein each of the plurality of second sets of features of the point cloud data corresponds to a respective stage of a plurality of second stages of a second encoder. 4 . The method of claim 3 , wherein the respective stage corresponding to the first set of features of the image frame corresponds to the respective stage corresponding to the second set of features of the point cloud data. 5 . The method of claim 1 , wherein the second set of features includes a plurality of pairs of perspective view features of the point cloud data and BEV features of the point cloud data. 6 . The method of claim 5 , wherein the third set of features of the image frame is determined by an encoder, the method further comprising: adding perspective view features of a pair of the plurality of pairs as a first channel of the encoder; and adding BEV features of the pair as a second channel of the encoder. 7 . The method of claim 5 , wherein the plurality of perspective view features and the plurality of BEV features are each determined from the plurality of voxels using global max pooling. 8 . The method of claim 1 , wherein the point cloud data is received from a ranging sensor. 9 . The method of claim 1 , further comprising detecting an object represented in the image frame based on the fused data. 10 . The method of claim 9 , further comprising controlling a function of a vehicle based on the object detected. 11 . An apparatus, comprising: a memory storing processor-readable code; and at least one processor coupled to the memory, the at least one processor configured to execute the processor-readable code to cause the at least one processor to perform operations including: receiving an image frame representing a scene; receiving point cloud data representing the scene; determining a plurality of first sets of features of the image frame; determining a plurality of second sets of features of the point cloud data based on a plurality of voxels representing the point cloud data; determining a third set of features of the image frame based on a first set of features of the plurality of first sets of features of the image frame and a second set of features of the plurality of second sets of features of the point cloud data; and outputting fused data that combines the third set of features of the image frame and a fourth set of features of the point cloud data. 12 . The apparatus of claim 11 , the operations further comprising: determining the first set of features and the second set of features based on first statistical indicators associated with the plurality of first sets of features of the image frame and second statistical indicators associated with the plurality of second sets of features of the point cloud data. 13 . The apparatus of claim 11 , wherein each of the plurality of first sets of features of the image frame corresponds to a respective stage of a plurality of first stages of a first encoder, and wherein each of the plurality of second sets of features of the point cloud data corresponds to a respective stage of a plurality of second stages of a second encoder. 14 . The apparatus of claim 13 , wherein the respective stage corresponding to the first set of features of the image frame corresponds to the respective stage corresponding to the second set of features of the point cloud data. 15 . The apparatus of claim 12 , wherein the second set of features includes a plurality of pairs of perspective view features of the point cloud data and BEV features of the point cloud data. 16 . The apparatus of claim 15 , wherein the third set of features of the image frame is determined by an encoder, the operations further comprising: adding perspective view features of a pair of the plurality of pairs as a first channel of the encoder; and adding BEV features of the pair as a second channel of the encoder. 17 . The apparatus of claim 15 , wherein the plurality of perspective view features and the plurality of BEV features are each determined from the plurality of voxels using global max pooling. 18 . The apparatus of claim 11 , wherein the point cloud data is received from a LiDAR sensor or a radar sensor. 19 . The apparatus of claim 11 , wherein the operations further including detecting an object represented in the image frame based on the fused data. 20 . The apparatus of claim 19 , wherein the operations further include controlling a function of a vehicle based on the object detected. 21 . A non-transitory computer-readable medium storing instructions that, when executed by a processor, cause the processor to perform operations comprising: receiving an image frame representing a scene; receiving point cloud data representing the scene; determining a plurality of first sets of features of the image frame; determining a plurality of second sets of features of the point cloud data based on a plurality of voxels representing the point cloud data; determining a third set of features of the image frame based on a first set of features of the plurality of first sets of features of the image frame and a second set of features of the plurality of second sets of features of the point cloud data; and outputting fused data that combines the third set of features of the image frame and a fourth set of features of the point cloud data. 22 . The non-transitory, computer-readable medium of claim 21 , the operations further comprising: determining the first set of features and the second set of features based on first statistical indicators associated with the plurality of first sets of features of the image frame and second statistical indicators associated with the plurality of second sets of features of the point cloud data. 23 . The non-transitory, computer-readable medium of claim 21 , wherein each of the plurality of third sets of features of the image frame corresponds to a respective stage of a plurality of first stages of a first encoder, and wherein each of the plurality of fourth sets of features of the point cloud data corresponds to a respective stage of a plurality of second stag

Assignees

Inventors

Classifications

  • Target detection · CPC title

  • Involving statistics of pixels or of feature values, e.g. histogram matching · CPC title

  • of extracted features · CPC title

  • G06V10/44Primary

    Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components · CPC title

  • exterior to a vehicle by using sensors mounted on the vehicle · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025029355A1 cover?
This disclosure provides systems, methods, and devices for vehicle driving assistance systems that support image processing. In a first aspect, a method includes receiving an image frame representing a scene; receiving point cloud data representing the scene; determining first sets of image frame features; determining second sets of point cloud data features based on a plurality of voxels repre…
Who is the assignee on this patent?
Qualcomm Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/44. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 23 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).