Method and apparatus for three-dimensional object perception
US-2025157230-A1 · May 15, 2025 · US
US12430849B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12430849-B2 |
| Application number | US-202318493035-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 24, 2023 |
| Priority date | Mar 13, 2023 |
| Publication date | Sep 30, 2025 |
| Grant date | Sep 30, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of texts corresponding to the plurality of 2D images respectively. A first text for a first image is generated by using one or more text descriptions generated by the first language model. A point cloud is generated by randomly sampling points in the 3D model. The first plurality of samples are generated using the plurality of 2D images, the corresponding plurality of texts, and the point cloud. The neural network based 3D encoder is trained using the training dataset including the first plurality of samples.
Opening claim text (preview).
What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder, the method comprising: generating a first plurality of samples of a training dataset using a first 3D model of a 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generating, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first plurality of samples. 2. The method of claim 1 , wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model. 3. The method of claim 1 , wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud. 4. The method of claim 1 , wherein viewpoints of the plurality of 2D images of the first 3D model are spaced equally around a center of a 3D object of the first 3D model. 5. The method of claim 1 , wherein the first language model includes a first generative model trained via multimodal learning. 6. The method of claim 1 , wherein the neural network based 3D encoder is trained using a loss objective, and wherein the loss objective includes a 3D-to-image alignment contrastive loss and a 3D-to-text alignment contrastive loss. 7. The method of claim 1 , wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 8. A system for providing a trained neural network based three-dimensional (3D) encoder, the system comprising: a memory that stores a neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a 3D model dataset including a plurality of 3D models; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating a first plurality of samples of the training dataset using a first 3D model of a 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generate, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first plurality of samples. 9. The system of claim 8 , wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model. 10. The system of claim 9 , wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud. 11. The system of claim 8 , wherein viewpoints of the plurality of 2D images include: a first plurality of viewpoints spaced equally on a first 360-degree circle around a center of a 3D object of the first 3D model; and a second plurality of viewpoints spaced equally on a second 360-degree circle around the center of the 3D object. 12. The system of claim 8 , wherein the first language model includes a first generative model trained via multimodal learning. 13. The system of claim 8 , wherein the neural network based 3D encoder is trained using a loss objective, and wherein the loss objective includes a 3D-to-image alignment contrastive loss and a 3D-to-text alignment contrastive loss. 14. The system of claim 8 , wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a 3D model dataset including a plurality of 3D models; generating a first plurality of samples of the training dataset using a first 3D model of the 3D model dataset, wherein the generating the first plurality of samples
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Combinations of networks · CPC title
Computing arrangements based on biological models · CPC title
Particle system, point based geometry or rendering · CPC title
Three-dimensional [3D] modelling for computer graphics · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.