What technology area does this patent fall under?

Primary CPC classification G06T17/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Systems and methods for multimodal pretraining for three-dimensional understanding models

US12430849B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12430849-B2
Application number	US-202318493035-A
Country	US
Kind code	B2
Filing date	Oct 24, 2023
Priority date	Mar 13, 2023
Publication date	Sep 30, 2025
Grant date	Sep 30, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of texts corresponding to the plurality of 2D images respectively. A first text for a first image is generated by using one or more text descriptions generated by the first language model. A point cloud is generated by randomly sampling points in the 3D model. The first plurality of samples are generated using the plurality of 2D images, the corresponding plurality of texts, and the point cloud. The neural network based 3D encoder is trained using the training dataset including the first plurality of samples.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder, the method comprising: generating a first plurality of samples of a training dataset using a first 3D model of a 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generating, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first plurality of samples. 2. The method of claim 1 , wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model. 3. The method of claim 1 , wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud. 4. The method of claim 1 , wherein viewpoints of the plurality of 2D images of the first 3D model are spaced equally around a center of a 3D object of the first 3D model. 5. The method of claim 1 , wherein the first language model includes a first generative model trained via multimodal learning. 6. The method of claim 1 , wherein the neural network based 3D encoder is trained using a loss objective, and wherein the loss objective includes a 3D-to-image alignment contrastive loss and a 3D-to-text alignment contrastive loss. 7. The method of claim 1 , wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 8. A system for providing a trained neural network based three-dimensional (3D) encoder, the system comprising: a memory that stores a neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a 3D model dataset including a plurality of 3D models; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating a first plurality of samples of the training dataset using a first 3D model of a 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generate, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first plurality of samples. 9. The system of claim 8 , wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model. 10. The system of claim 9 , wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud. 11. The system of claim 8 , wherein viewpoints of the plurality of 2D images include: a first plurality of viewpoints spaced equally on a first 360-degree circle around a center of a 3D object of the first 3D model; and a second plurality of viewpoints spaced equally on a second 360-degree circle around the center of the 3D object. 12. The system of claim 8 , wherein the first language model includes a first generative model trained via multimodal learning. 13. The system of claim 8 , wherein the neural network based 3D encoder is trained using a loss objective, and wherein the loss objective includes a 3D-to-image alignment contrastive loss and a 3D-to-text alignment contrastive loss. 14. The system of claim 8 , wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a 3D model dataset including a plurality of 3D models; generating a first plurality of samples of the training dataset using a first 3D model of the 3D model dataset, wherein the generating the first plurality of samples

Assignees

Salesforce Inc

Inventors

Classifications

G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/00
Computing arrangements based on biological models · CPC title
G06T2210/56
Particle system, point based geometry or rendering · CPC title
G06T17/00Primary
Three-dimensional [3D] modelling for computer graphics · CPC title

Patent family

Related publications grouped by family.

View patent family 92714231

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12430849B2 cover?: A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of tex…
Who is the assignee on this patent?: Salesforce Inc
What technology area does this patent fall under?: Primary CPC classification G06T17/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).