Systems and methods for learning unified representations of language, image, and point cloud for three-dimensional recognition

US12417384B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12417384-B2
Application numberUS-202318182939-A
CountryUS
Kind codeB2
Filing dateMar 13, 2023
Priority dateNov 11, 2022
Publication dateSep 16, 2025
Grant dateSep 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of training a neural network based three-dimensional (3D) encoder is provided. A training dataset is generated using a plurality of 3D models of a 3D model dataset. To generate a first sample of the training dataset, an image generator with multi-view rendering is used to generate a plurality of image candidates of a first 3D model. A word is chosen from metadata associated with the first 3D model. A language model is used to generate one or more text descriptions using the selected word and a plurality of prompts. A point cloud is generated by randomly sampling points in the 3D model. The first sample is generated to include a first image randomly selected from the plurality of image candidates, one or more text descriptions, and the point cloud is generated. The 3D encoder is trained using the training dataset including the first sample.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder, the method comprising generating a plurality of samples of the training dataset using a plurality of 3D models of a 3D model dataset, wherein the generating the plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of image candidates having different viewpoints of a first 3D model; randomly selecting a first image from the plurality of image candidates; randomly choosing a word from metadata associated with the first 3D model; generating, using a language model, one or more text descriptions using the selected word and a plurality of prompts, wherein the plurality of prompts include a prompt indicating a 3D modality; generating a point cloud by randomly sampling points in the 3D model; and generating a first sample including the first image, one or more text descriptions, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first sample. 2. The method of claim 1 , wherein the generating the point cloud includes: performing an augmentation to the point cloud to generate an augmented point cloud; wherein the point cloud of the first sample includes the augmented point cloud. 3. The method of claim 2 , wherein the augmentation performed to the point cloud includes one of a random point drop augmentation, a random scaling point cloud augmentation, a shift point cloud augmentation, and a rotate perturbation augmentation. 4. The method of claim 1 , wherein the first image includes an RGB image. 5. The method of claim 1 , wherein the first image includes a depth map. 6. The method of claim 1 , wherein the training the neural network based 3D encoder using the training dataset including the first sample includes: generating image representations using the first image of the first sample; generating text representations using the one or more text descriptions of the first sample; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 7. The method of claim 6 , wherein the image representations and the text representations are generated using a pretrained vision and language model. 8. A system for providing a trained neural network based three-dimensional (3D) encoder, the system comprising: a memory that stores a neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a 3D model dataset including a plurality of 3D models; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating a training dataset including a plurality of samples of a training dataset using the plurality of 3D models of the 3D model dataset, wherein the generating a first sample of the training dataset includes: generating, using an image generator with multi-view rendering, a plurality of image candidates having different viewpoints of a first 3D model; randomly selecting a first image from the plurality of image candidates; generating, using a language model, one or more text descriptions using the selected word and a plurality of prompts, wherein the plurality of prompts include a prompt indicating a 3D modality; generating a point cloud by randomly sampling points in the 3D model; and generating a first sample including the first image, one or more text descriptions, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first sample. 9. The system of claim 8 , wherein the generating the point cloud includes: performing augmentation to the point cloud to generate an augmented point cloud; wherein the point cloud of the first sample includes the augmented point cloud. 10. The system of claim 9 , wherein the augmentation performed to the point cloud includes one of a random point drop augmentation, a random scaling point cloud augmentation, a shift point cloud augmentation, and a rotate perturbation augmentation. 11. The system of claim 8 , wherein the first image includes an RGB image. 12. The system of claim 8 , wherein the first image includes a depth map. 13. The system of claim 8 , wherein the training the neural network based 3D encoder using the training dataset including the first sample includes: generating image representations using the first image of the first sample; generating text representations using the one or more text descriptions of the first sample; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 14. The system of claim 13 , wherein the image representations and the text representations are generated using a pretrained vision and language model. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a 3D model dataset including a plurality of 3D models; generating a plurality of samples of the training dataset using the plurality of 3D models of the 3D model dataset, wherein the generating the plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of image candidates having different viewpoints of a first 3D model; randomly selecting a first image from the plurality of image candidates; randomly choosing a word from metadata associated with the first 3D model; generating, using a language model, one or more text descriptions using the selected word and a plurality of prompts, wherein the plurality of prompts include a prompt indicating a 3D modality; generating a point cloud by randomly sampling points in the 3D model; generating a first sample including the first image, one or more text descriptions, and the point cloud; and training a neural network based 3D encoder using the training dataset including the first sample. 16. The non-transitory machine-readable medium of claim 15 , wherein the generating the point cloud includes: performing augmentation to the point cloud to generate an augmented point cloud; wherein the point cloud of the first sample includes the augmented point cloud. 17. The non-transitory machine-readable medium of claim 16 , wherein the augmentation performed to the point cloud includes one of a random point drop augmentation, a random scaling point cloud augmentation, a shift point cloud augmentation, and a rotate perturbation augmentation. 18. The non-transitory machine-readable medium of claim 15 , wherein the first image includes an RGB image or a depth map. 19. The non-transitory machine-readable medium of claim 15 , wherein the training the neural network based 3D encoder using the training dataset including the first sample includes: generating image representations using the first image of the first sample; generating text representations using the one or more text descriptions of the first sample; generating 3D representations using the point cloud of the first sam

Assignees

Inventors

Classifications

  • using classification, e.g. of video objects · CPC title

  • Validation; Performance evaluation · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • using neural networks · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12417384B2 cover?
A method of training a neural network based three-dimensional (3D) encoder is provided. A training dataset is generated using a plurality of 3D models of a 3D model dataset. To generate a first sample of the training dataset, an image generator with multi-view rendering is used to generate a plurality of image candidates of a first 3D model. A word is chosen from metadata associated with the fi…
Who is the assignee on this patent?
Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).