What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Systems and methods for learning unified representations of language, image, and point cloud for three-dimensional recognition

US12417385B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12417385-B2
Application number	US-202318182952-A
Country	US
Kind code	B2
Filing date	Mar 13, 2023
Priority date	Nov 11, 2022
Publication date	Sep 16, 2025
Grant date	Sep 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for training a neural network based three-dimensional (3D) encoder for 3D classification are provided. A training dataset including a plurality of samples is received, wherein a first sample includes an image, a text, and a point cloud. An image encoder of a pretrained vision and language model is used to generate image representations for the image of the first sample. A text encoder of the pretrained vision and language model is used to generate text representations for the text of the first sample. The neural network based 3D encoder is used to generate 3D representations for the point cloud of the first sample. A loss objective is computed based on the image representations, text representations, and 3D representations. Parameters of the neural network based 3D encoder are updated based on the computed loss objective via backpropagation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder for 3D classification, the method comprising: receiving, via a data interface, a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using the neural network based 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss objective via backpropagation. 2. The method of claim 1 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 3. The method of claim 2 , wherein the loss objective is used to align the 3D representations with the image representations and the text representations. 4. The method of claim 1 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss objective while the pretrained vision and language model is frozen. 5. The method of claim 1 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 6. The method of claim 1 , wherein the trained 3D encoder is further finetuned with a classification head to perform a 3D classification task. 7. The method of claim 1 , wherein the trained 3D encoder is used with one of the text encoder and the image encoder to perform a zero shot 3D classification task. 8. A system for training a three-dimensional (3D) encoder for 3D classification, the system comprising: a memory that stores the neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using the neural network based 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss objective via backpropagation. 9. The system of claim 8 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 10. The system of claim 9 , wherein the loss objective is used to align the 3D representations with both the image representations and the text representations. 11. The system of claim 8 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss objective while the pretrained vision and language model is frozen. 12. The system of claim 8 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 13. The system of claim 8 , wherein the trained 3D encoder is further finetuned to with a classification head to perform a 3D classification task. 14. The system of claim 8 , wherein the trained 3D encoder is used with one or more of the text encoder and the image encoder to perform a zero shot 3D classification task. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using a 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss function via backpropagation. 16. The non-transitory machine-readable medium of claim 15 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 17. The non-transitory machine-readable medium of claim 15 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss function while the pretrained vision and language model is frozen. 18. The non-transitory machine-readable medium of claim 15 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 19. The non-transitory machine-readable medium of claim 15 , wherein the trained neural network based 3D encoder is further finetuned to with a classification head to perform a 3D classification task. 20. The non-transitory machine-readable medium of claim 15 , wherein the trained neural network based 3D encoder is used with one or more of the text encoder and the image encoder to perform a zero shot 3D classification task.

Assignees

Salesforce Inc

Inventors

Classifications

G06V10/764
using classification, e.g. of video objects · CPC title
G06V10/776
Validation; Performance evaluation · CPC title
G06V10/774
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
G06V10/82
using neural networks · CPC title
G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

View patent family 91028145

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12417385B2 cover?: Systems and methods for training a neural network based three-dimensional (3D) encoder for 3D classification are provided. A training dataset including a plurality of samples is received, wherein a first sample includes an image, a text, and a point cloud. An image encoder of a pretrained vision and language model is used to generate image representations for the image of the first sample. A te…
Who is the assignee on this patent?: Salesforce Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).