Robotic meal-assembly systems and robotic methods for real-time object pose estimation of high-resemblance random food items
US-2024246240-A1 · Jul 25, 2024 · US
US12586362B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12586362-B2 |
| Application number | US-202217987060-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 15, 2022 |
| Priority date | Nov 15, 2021 |
| Publication date | Mar 24, 2026 |
| Grant date | Mar 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, apparatus, electronic device, and non-transitory computer-readable storage medium with multi-modal feature fusion are provided. The method includes generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image, generating fused feature information by fusing the 3D feature information and the 2D feature information based on an attention mechanism, and generating predicted image information by performing image processing based on the fused feature information.
Opening claim text (preview).
What is claimed is: 1 . A processor-implemented method performed by a computing apparatus, the method comprising: generating three-dimensional (3D) feature information and two-dimensional (2D) feature information based on a color image and a depth image; generating fused feature information by fusing the 3D feature information and the 2D feature information based on an attention mechanism; and performing at least one of estimating a six-dimensional (6D) pose of an object, estimating a size of the object, reconstructing a shape of the object, or segmenting the object based on the fused feature information, wherein the attention mechanism includes a self-attention mechanism and cross-attention mechanism, wherein the fused feature information is generated by fusing the 3D feature information of at least one scale and the 2D feature information of at least one scale, and wherein the generating of the fused feature information comprises, for the 3D feature information of one scale and the 2D feature information of one scale, generating fused feature information of a current scale by performing a feature fusion on 3D feature information of the current scale and 2D feature information of the current scale based on the attention mechanism, the 3D feature information of the current scale being determined based on fused feature information of a previous scale and 3D feature information of the previous scale, and the 2D feature information of the current scale being determined based on 2D feature information of the previous scale. 2 . The method of claim 1 , wherein the generating of the fused feature information comprises: acquiring point cloud voxel feature information and/or voxel position feature information based on the 3D feature information; generating first image voxel feature information based on the 2D feature information; and performing the generating of the fused feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism. 3 . The method of claim 2 , wherein the performing of the generating of the fused feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism comprises one of: generating the fused feature information by fusing features using the cross-attention mechanism that is dependent on the first image voxel feature information and feature information output by the self-attention mechanism, that is dependent on the voxel position feature information, the point cloud voxel feature information, and the first image voxel feature information; or generating the fused feature information by fusing features using the cross-attention mechanism that is dependent on the first image voxel feature information and another feature information output by the self-attention mechanism, that is dependent on the point cloud voxel feature information. 4 . A non-transitory computer-readable storage medium storing instructions that, when executed in one or more processors of the computing apparatus, configure the one or more processors to perform the method of claim 1 . 5 . An apparatus comprising: one or more processors comprising processing circuitry; memory comprising one or more storage media storing instructions that, when executed by the one or more processors individually or collectively, cause the apparatus to: generate three-dimensional (3D) feature information based on a depth image; generate two-dimensional (2D) feature information based on a color image; fuse the 3D feature information and the 2D feature information using an attention mechanism to generate fused feature information; and perform at least one of an estimation of a six-dimensional (6D) pose of an object, an estimation of a size of the object, a reconstruction of a shape of the object, or a segmenting of the object based on the fused feature information, wherein the attention mechanism includes a self-attention mechanism and cross-attention mechanism, wherein the fused feature information is generated through a fusing of the 3D feature information of at least one scale and the 2D feature information of at least one scale, and wherein the generation of the fused feature information comprises, for the 3D feature information of one scale and the 2D feature information of one scale, generation of fused feature information of a current scale by performing a feature fusion on 3D feature information of the current scale and 2D feature information of the current scale based on the attention mechanism, the 3D feature information of the current scale being determined based on fused feature information of a previous scale and 3D feature information of the previous scale, and the 2D feature information of the current scale being determined based on 2D feature information of the previous scale. 6 . The apparatus of claim 5 , wherein the instructions, when executed by the one or more processors individually or collectively, further cause the apparatus to: generate point cloud voxel feature information and/or voxel position feature information based on the 3D feature information; generate first image voxel feature information based on the 2D feature information; and perform the fusing of the 3D feature information and the 2D feature information based on the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, based on the attention mechanism. 7 . The apparatus of claim 5 , wherein the apparatus is an AR device that further comprises one or more cameras configured to respectively capture the depth image and the color image, and one or more displays to display AR image information based on the predicted image information.
Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform · CPC title
Range image; Depth image; 3D point clouds · CPC title
Color image · CPC title
Three-dimensional [3D] modelling for computer graphics · CPC title
Determining position or orientation of objects or cameras (camera calibration G06T7/80) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.