Systems and methods for attention mechanism in three-dimensional object detection

US12394220B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12394220-B2
Application numberUS-202318161661-A
CountryUS
Kind codeB2
Filing dateJan 30, 2023
Priority dateNov 10, 2022
Publication dateAug 19, 2025
Grant dateAug 19, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide a system for three-dimensional (3D) object detection. The system includes an input interface configured to obtain 3D point data describing spatial information of a plurality of points, and a memory storing a neural network based 3D object detection model having an encoder and a decoder. The system also includes processors to perform operations including: encoding, by the encoder, a first set of coordinates into a first set of point features and a set of object features; sampling a second set of point features from the first set of point features; generating, by attention layers at the decoder, a set of attention weights by applying cross-attention over at least the set of object features and the second set of point feature, and generate, by the decoder, a predicted bounding box among the plurality of points based on at least in part on the set of attention weights.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for three-dimensional (3D) object detection, the system comprising: an input interface configured to obtain 3D point data including a plurality of coordinates describing spatial information of a plurality of points; a memory storing a neural network based 3D object detection model comprising an encoder and a decoder, and a plurality of processor-executable instructions; and one or more processors executing the plurality of processor-executable instructions to perform operations comprising: encoding, by the encoder, a first set of coordinates into a first set of point features and a set of object features; sampling a second set of point features from the first set of point features, wherein the second set of point features are obtained by: upsampling the first set of coordinates into a second set of coordinates that contains more sample points than the first set of coordinates; determining, for each sampled point in the second set of coordinates, a respective subset of nearest neighbors from the first set of point features; and computing a corresponding point feature for the each sampled point in the second set of coordinates based on an interpolation of the respective subset of nearest neighbors; generating, by one or more attention layers at the decoder, a set of attention weights by applying cross-attention over at least the set of object features and the second set of point feature, and generate, by the decoder, a predicted bounding box among the plurality of points based on at least in part on the set of attention weights. 2. The system of claim 1 , wherein the determining of the second set of point features comprises: determining, by the encoder, three nearest neighbor points of the each sampled point in the second set of coordinates; determining, by the encoder, point features of the three nearest neighbor points in the first set of point features; performing, by the encoder, a weighted interpolation of the point features of the three nearest neighbor points; and projecting, by the encoder, the interpolated point feature into a feature representation of the each sampled point in the second set of coordinates. 3. The system of claim 2 , wherein the weighted interpolation comprises weighting each of the point features of the three nearest neighbor points by an inverse of the respective Euclidean distance to the each sampled point in the second set of coordinates. 4. The system of claim 1 , wherein the generating of the set of attention weights comprises: generating a first attention weight using the first set of point features and the set of object features; generating a second attention weight using the second set of point features and the set of object features; and concatenating the first attention weight and the second attention weight to form the set of attention weights. 5. The system of claim 1 , wherein the second set of coordinates contains at least twice a number of sampled points than the first set of coordinates. 6. The system of claim 1 , wherein the second set of point features are obtained by: predicting, by the decoder, an intermediate bounding box proposal based on the set of object features; performing cross-attention between the set of object features and candidate points in the intermediate bounding box proposal; and determining, from the first set of point features, a sampled point feature that belongs to the intermediate bounding box proposal based on the cross-attention. 7. The system of claim 6 , wherein the set of attention weights are obtained by: performing multi-head attention between a batch of object features from the set of object features and a batch of point features from the second set of point features. 8. The system of claim 7 , wherein the batch of point features are obtained by processing the second set of point features to have a same token length through padding or truncating tokens. 9. A method of three-dimensional (3D) object detection, the method comprising: receiving, via a data interface, 3D point data including a plurality of coordinates describing spatial information of a plurality of points; encoding, by an encoder, a first set of coordinates into a first set of point features and a set of object features; sampling a second set of point features from the first set of point features, wherein the second set of point features are obtained by: upsampling the first set of coordinates into a second set of coordinates that contains more sample points than the first set of coordinates; determining, for each sampled point in the second set of coordinates, a respective subset of nearest neighbors from the first set of point features; and computing a corresponding point feature for the each sampled point in the second set of coordinates based on an interpolation of the respective subset of nearest neighbors; generating, by one or more attention layers at a decoder, a set of attention weights by applying cross-attention over at least the set of object features and the second set of point feature, and generate, by the decoder, a predicted bounding box among the plurality of points based on at least in part on the set of attention weights. 10. The method of claim 9 , wherein the determining of the second set of point features comprises: determining, by the encoder, three nearest neighbor points of the each sampled point in the second set of coordinates; determining, by the encoder, point features of the three nearest neighbor points in the first set of point features; performing, by the encoder, a weighted interpolation of the point features of the three nearest neighbor points; and projecting, by the encoder, the interpolated point feature into a feature representation of the each sampled point in the second set of coordinates. 11. The method of claim 10 , wherein the performing of the weighted interpolation comprises weighting each of the point features of the three nearest neighbor points by an inverse of the respective Euclidean distance to the each sampled point in the second set of coordinates. 12. The method of claim 9 , wherein the generating of the set of attention weights comprises: generating a first attention weight using the first set of point features and the set of object features; generating a second attention weight using the second set of point features and the set of object features; and concatenating the first attention weight and the second attention weight to form the set of attention weights. 13. The method of claim 9 , wherein the second set of coordinates contains at least twice a number of sampled points than the first set of coordinates. 14. The method of claim 9 , wherein the second set of point features are obtained by: predicting, by the decoder, an intermediate bounding box proposal based on the set of object features; performing cross-attention between the set of object features and candidate points in the intermediate bounding box proposal; and determining, from the first set of point features, a sampled point feature that belongs to the intermediate bounding box proposal based on the cross-attention. 15. The method of claim 14 , wherein the set of attention weights are obtained by: performing multi-head attention between a batch of object features from the set of object features and a batch of point features from the second set of point features. 16. The method of claim 15 , wherein the batch of point features are obtained by processing the second set of point features to have a same token length through padding or truncating tokens.

Assignees

Inventors

Classifications

  • Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features (colour feature extraction G06V10/56) · CPC title

  • using neural networks · CPC title

  • based on interpolation, e.g. bilinear interpolation (image demosaicing G06T3/4015; edge-driven or edge-based scaling G06T3/403) · CPC title

  • Combinations of networks · CPC title

  • Depth or shape recovery · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12394220B2 cover?
Embodiments described herein provide a system for three-dimensional (3D) object detection. The system includes an input interface configured to obtain 3D point data describing spatial information of a plurality of points, and a memory storing a neural network based 3D object detection model having an encoder and a decoder. The system also includes processors to perform operations including: enc…
Who is the assignee on this patent?
Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06V20/64. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 19 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).