What technology area does this patent fall under?

Primary CPC classification G06V10/806. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus with multi-modal feature fusion

US12586362B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12586362-B2
Application number	US-202217987060-A
Country	US
Kind code	B2
Filing date	Nov 15, 2022
Priority date	Nov 15, 2021
Publication date	Mar 24, 2026
Grant date	Mar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, apparatus, electronic device, and non-transitory computer-readable storage medium with multi-modal feature fusion are provided. The method includes generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image, generating fused feature information by fusing the 3D feature information and the 2D feature information based on an attention mechanism, and generating predicted image information by performing image processing based on the fused feature information.

First claim

Opening claim text (preview).

What is claimed is: 1 . A processor-implemented method performed by a computing apparatus, the method comprising: generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image; generating fused feature information by fusing the 3D feature information and the 2D feature information based on an attention mechanism; and performing at least one of estimating a six-dimensional (6D) pose of an object, estimating a size of the object, reconstructing a shape of the object, or segmenting the object based on the fused feature information, wherein the attention mechanism includes a self-attention mechanism and cross-attention mechanism, wherein the fused feature information is generated by fusing the 3D feature information of at least one scale and the 2D feature information of at least one scale, and wherein the generating of the fused feature information comprises, for the 3D feature information of one scale and the 2D feature information of one scale, generating fused feature information of a current scale by performing a feature fusion on 3D feature information of the current scale and 2D feature information of the current scale based on the attention mechanism, the 3D feature information of the current scale being determined based on fused feature information of a previous scale and 3D feature information of the previous scale, and the 2D feature information of the current scale being determined based on 2D feature information of the previous scale. 2 . The method of claim 1 , wherein the generating of the fused feature information comprises: acquiring point cloud voxel feature information and/or voxel position feature information based on the 3D feature information; generating first image voxel feature information based on the 2D feature information; and performing the generating of the fused feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism. 3 . The method of claim 2 , wherein the performing of the generating of the fused feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism comprises one of: generating the fused feature information by fusing features using the cross-attention mechanism that is dependent on the first image voxel feature information and feature information output by the self-attention mechanism, that is dependent on the voxel position feature information, the point cloud voxel feature information, and the first image voxel feature information; or generating the fused feature information by fusing features using the cross-attention mechanism that is dependent on the first image voxel feature information and another feature information output by the self-attention mechanism, that is dependent on the point cloud voxel feature information. 4 . A non-transitory computer-readable storage medium storing instructions that, when executed in one or more processors of the computing apparatus, configure the one or more processors to perform the method of claim 1 . 5 . An apparatus comprising: one or more processors comprising processing circuitry; memory comprising one or more storage media storing instructions that, when executed by the one or more processors individually or collectively, cause the apparatus to: generate three-dimensional (3D) feature information based on a depth image; generate two-dimensional (2D) feature information based on a color image; fuse the 3D feature information and the 2D feature information using an attention mechanism to generate fused feature information; and perform at least one of an estimation of a six-dimensional (6D) pose of an object, an estimation of a size of the object, a reconstruction of a shape of the object, or a segmenting of the object based on the fused feature information, wherein the attention mechanism includes a self-attention mechanism and cross-attention mechanism, wherein the fused feature information is generated through a fusing of the 3D feature information of at least one scale and the 2D feature information of at least one scale, and wherein the generation of the fused feature information comprises, for the 3D feature information of one scale and the 2D feature information of one scale, generation of fused feature information of a current scale by performing a feature fusion on 3D feature information of the current scale and 2D feature information of the current scale based on the attention mechanism, the 3D feature information of the current scale being determined based on fused feature information of a previous scale and 3D feature information of the previous scale, and the 2D feature information of the current scale being determined based on 2D feature information of the previous scale. 6 . The apparatus of claim 5 , wherein the instructions, when executed by the one or more processors individually or collectively, further cause the apparatus to: generate point cloud voxel feature information and/or voxel position feature information based on the 3D feature information; generate first image voxel feature information based on the 2D feature information; and perform the fusing of the 3D feature information and the 2D feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism. 7 . The apparatus of claim 5 , wherein the apparatus is an AR device that further comprises one or more cameras configured to respectively capture the depth image and the color image, and one or more displays to display AR image information based on the predicted image information.

Assignees

Samsung Electronics Co Ltd

Inventors

Classifications

G06T2207/20016
Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform · CPC title
G06T2207/10028
Range image; Depth image; 3D point clouds · CPC title
G06T2207/10024
Color image · CPC title
G06T17/00
Three-dimensional [3D] modelling for computer graphics · CPC title
G06T7/70
Determining position or orientation of objects or cameras (camera calibration G06T7/80) · CPC title

Patent family

Related publications grouped by family.

View patent family 84331648

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586362B2 cover?: A method, apparatus, electronic device, and non-transitory computer-readable storage medium with multi-modal feature fusion are provided. The method includes generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image, generating fused feature information by fusing the 3D feature information and the 2D feature infor…
Who is the assignee on this patent?: Samsung Electronics Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V10/806. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Robotic meal-assembly systems and robotic methods for real-time object pose estimation of high-resemblance random food items

Sensor data fusion using cross-modal transformer

Methods and systems for semantic segmentation of a point cloud

Systems and methods for picking objects using 3-d geometry and segmentation

Object pose estimation

Frequently asked questions