Method and apparatus with multi-modal feature fusion

US12586362B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586362-B2
Application numberUS-202217987060-A
CountryUS
Kind codeB2
Filing dateNov 15, 2022
Priority dateNov 15, 2021
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, apparatus, electronic device, and non-transitory computer-readable storage medium with multi-modal feature fusion are provided. The method includes generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image, generating fused feature information by fusing the 3D feature information and the 2D feature information based on an attention mechanism, and generating predicted image information by performing image processing based on the fused feature information.

First claim

Opening claim text (preview).

What is claimed is: 1 . A processor-implemented method performed by a computing apparatus, the method comprising: generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image; generating fused feature information by fusing the 3D feature information and the 2D feature information based on an attention mechanism; and performing at least one of estimating a six-dimensional (6D) pose of an object, estimating a size of the object, reconstructing a shape of the object, or segmenting the object based on the fused feature information, wherein the attention mechanism includes a self-attention mechanism and cross-attention mechanism, wherein the fused feature information is generated by fusing the 3D feature information of at least one scale and the 2D feature information of at least one scale, and wherein the generating of the fused feature information comprises, for the 3D feature information of one scale and the 2D feature information of one scale, generating fused feature information of a current scale by performing a feature fusion on 3D feature information of the current scale and 2D feature information of the current scale based on the attention mechanism, the 3D feature information of the current scale being determined based on fused feature information of a previous scale and 3D feature information of the previous scale, and the 2D feature information of the current scale being determined based on 2D feature information of the previous scale. 2 . The method of claim 1 , wherein the generating of the fused feature information comprises: acquiring point cloud voxel feature information and/or voxel position feature information based on the 3D feature information; generating first image voxel feature information based on the 2D feature information; and performing the generating of the fused feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism. 3 . The method of claim 2 , wherein the performing of the generating of the fused feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism comprises one of: generating the fused feature information by fusing features using the cross-attention mechanism that is dependent on the first image voxel feature information and feature information output by the self-attention mechanism, that is dependent on the voxel position feature information, the point cloud voxel feature information, and the first image voxel feature information; or generating the fused feature information by fusing features using the cross-attention mechanism that is dependent on the first image voxel feature information and another feature information output by the self-attention mechanism, that is dependent on the point cloud voxel feature information. 4 . A non-transitory computer-readable storage medium storing instructions that, when executed in one or more processors of the computing apparatus, configure the one or more processors to perform the method of claim 1 . 5 . An apparatus comprising: one or more processors comprising processing circuitry; memory comprising one or more storage media storing instructions that, when executed by the one or more processors individually or collectively, cause the apparatus to: generate three-dimensional (3D) feature information based on a depth image; generate two-dimensional (2D) feature information based on a color image; fuse the 3D feature information and the 2D feature information using an attention mechanism to generate fused feature information; and perform at least one of an estimation of a six-dimensional (6D) pose of an object, an estimation of a size of the object, a reconstruction of a shape of the object, or a segmenting of the object based on the fused feature information, wherein the attention mechanism includes a self-attention mechanism and cross-attention mechanism, wherein the fused feature information is generated through a fusing of the 3D feature information of at least one scale and the 2D feature information of at least one scale, and wherein the generation of the fused feature information comprises, for the 3D feature information of one scale and the 2D feature information of one scale, generation of fused feature information of a current scale by performing a feature fusion on 3D feature information of the current scale and 2D feature information of the current scale based on the attention mechanism, the 3D feature information of the current scale being determined based on fused feature information of a previous scale and 3D feature information of the previous scale, and the 2D feature information of the current scale being determined based on 2D feature information of the previous scale. 6 . The apparatus of claim 5 , wherein the instructions, when executed by the one or more processors individually or collectively, further cause the apparatus to: generate point cloud voxel feature information and/or voxel position feature information based on the 3D feature information; generate first image voxel feature information based on the 2D feature information; and perform the fusing of the 3D feature information and the 2D feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism. 7 . The apparatus of claim 5 , wherein the apparatus is an AR device that further comprises one or more cameras configured to respectively capture the depth image and the color image, and one or more displays to display AR image information based on the predicted image information.

Assignees

Inventors

Classifications

  • Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform · CPC title

  • Range image; Depth image; 3D point clouds · CPC title

  • Color image · CPC title

  • Three-dimensional [3D] modelling for computer graphics · CPC title

  • Determining position or orientation of objects or cameras (camera calibration G06T7/80) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586362B2 cover?
A method, apparatus, electronic device, and non-transitory computer-readable storage medium with multi-modal feature fusion are provided. The method includes generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image, generating fused feature information by fusing the 3D feature information and the 2D feature infor…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V10/806. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).