Systems and methods for learning unified representations of language, image, and point cloud for three-dimensional recognition

US12417385B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12417385-B2
Application numberUS-202318182952-A
CountryUS
Kind codeB2
Filing dateMar 13, 2023
Priority dateNov 11, 2022
Publication dateSep 16, 2025
Grant dateSep 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for training a neural network based three-dimensional (3D) encoder for 3D classification are provided. A training dataset including a plurality of samples is received, wherein a first sample includes an image, a text, and a point cloud. An image encoder of a pretrained vision and language model is used to generate image representations for the image of the first sample. A text encoder of the pretrained vision and language model is used to generate text representations for the text of the first sample. The neural network based 3D encoder is used to generate 3D representations for the point cloud of the first sample. A loss objective is computed based on the image representations, text representations, and 3D representations. Parameters of the neural network based 3D encoder are updated based on the computed loss objective via backpropagation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder for 3D classification, the method comprising: receiving, via a data interface, a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using the neural network based 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss objective via backpropagation. 2. The method of claim 1 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 3. The method of claim 2 , wherein the loss objective is used to align the 3D representations with the image representations and the text representations. 4. The method of claim 1 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss objective while the pretrained vision and language model is frozen. 5. The method of claim 1 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 6. The method of claim 1 , wherein the trained 3D encoder is further finetuned with a classification head to perform a 3D classification task. 7. The method of claim 1 , wherein the trained 3D encoder is used with one of the text encoder and the image encoder to perform a zero shot 3D classification task. 8. A system for training a three-dimensional (3D) encoder for 3D classification, the system comprising: a memory that stores the neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using the neural network based 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss objective via backpropagation. 9. The system of claim 8 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 10. The system of claim 9 , wherein the loss objective is used to align the 3D representations with both the image representations and the text representations. 11. The system of claim 8 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss objective while the pretrained vision and language model is frozen. 12. The system of claim 8 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 13. The system of claim 8 , wherein the trained 3D encoder is further finetuned to with a classification head to perform a 3D classification task. 14. The system of claim 8 , wherein the trained 3D encoder is used with one or more of the text encoder and the image encoder to perform a zero shot 3D classification task. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using a 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss function via backpropagation. 16. The non-transitory machine-readable medium of claim 15 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 17. The non-transitory machine-readable medium of claim 15 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss function while the pretrained vision and language model is frozen. 18. The non-transitory machine-readable medium of claim 15 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 19. The non-transitory machine-readable medium of claim 15 , wherein the trained neural network based 3D encoder is further finetuned to with a classification head to perform a 3D classification task. 20. The non-transitory machine-readable medium of claim 15 , wherein the trained neural network based 3D encoder is used with one or more of the text encoder and the image encoder to perform a zero shot 3D classification task.

Assignees

Inventors

Classifications

  • using classification, e.g. of video objects · CPC title

  • Validation; Performance evaluation · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • using neural networks · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12417385B2 cover?
Systems and methods for training a neural network based three-dimensional (3D) encoder for 3D classification are provided. A training dataset including a plurality of samples is received, wherein a first sample includes an image, a text, and a point cloud. An image encoder of a pretrained vision and language model is used to generate image representations for the image of the first sample. A te…
Who is the assignee on this patent?
Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).