Video retrieval method and apparatus, device, and storage medium
US-2023297617-A1 · Sep 21, 2023 · US
US12417384B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12417384-B2 |
| Application number | US-202318182939-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 13, 2023 |
| Priority date | Nov 11, 2022 |
| Publication date | Sep 16, 2025 |
| Grant date | Sep 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of training a neural network based three-dimensional (3D) encoder is provided. A training dataset is generated using a plurality of 3D models of a 3D model dataset. To generate a first sample of the training dataset, an image generator with multi-view rendering is used to generate a plurality of image candidates of a first 3D model. A word is chosen from metadata associated with the first 3D model. A language model is used to generate one or more text descriptions using the selected word and a plurality of prompts. A point cloud is generated by randomly sampling points in the 3D model. The first sample is generated to include a first image randomly selected from the plurality of image candidates, one or more text descriptions, and the point cloud is generated. The 3D encoder is trained using the training dataset including the first sample.
Opening claim text (preview).
What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder, the method comprising generating a plurality of samples of the training dataset using a plurality of 3D models of a 3D model dataset, wherein the generating the plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of image candidates having different viewpoints of a first 3D model; randomly selecting a first image from the plurality of image candidates; randomly choosing a word from metadata associated with the first 3D model; generating, using a language model, one or more text descriptions using the selected word and a plurality of prompts, wherein the plurality of prompts include a prompt indicating a 3D modality; generating a point cloud by randomly sampling points in the 3D model; and generating a first sample including the first image, one or more text descriptions, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first sample. 2. The method of claim 1 , wherein the generating the point cloud includes: performing an augmentation to the point cloud to generate an augmented point cloud; wherein the point cloud of the first sample includes the augmented point cloud. 3. The method of claim 2 , wherein the augmentation performed to the point cloud includes one of a random point drop augmentation, a random scaling point cloud augmentation, a shift point cloud augmentation, and a rotate perturbation augmentation. 4. The method of claim 1 , wherein the first image includes an RGB image. 5. The method of claim 1 , wherein the first image includes a depth map. 6. The method of claim 1 , wherein the training the neural network based 3D encoder using the training dataset including the first sample includes: generating image representations using the first image of the first sample; generating text representations using the one or more text descriptions of the first sample; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 7. The method of claim 6 , wherein the image representations and the text representations are generated using a pretrained vision and language model. 8. A system for providing a trained neural network based three-dimensional (3D) encoder, the system comprising: a memory that stores a neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a 3D model dataset including a plurality of 3D models; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating a training dataset including a plurality of samples of a training dataset using the plurality of 3D models of the 3D model dataset, wherein the generating a first sample of the training dataset includes: generating, using an image generator with multi-view rendering, a plurality of image candidates having different viewpoints of a first 3D model; randomly selecting a first image from the plurality of image candidates; generating, using a language model, one or more text descriptions using the selected word and a plurality of prompts, wherein the plurality of prompts include a prompt indicating a 3D modality; generating a point cloud by randomly sampling points in the 3D model; and generating a first sample including the first image, one or more text descriptions, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first sample. 9. The system of claim 8 , wherein the generating the point cloud includes: performing augmentation to the point cloud to generate an augmented point cloud; wherein the point cloud of the first sample includes the augmented point cloud. 10. The system of claim 9 , wherein the augmentation performed to the point cloud includes one of a random point drop augmentation, a random scaling point cloud augmentation, a shift point cloud augmentation, and a rotate perturbation augmentation. 11. The system of claim 8 , wherein the first image includes an RGB image. 12. The system of claim 8 , wherein the first image includes a depth map. 13. The system of claim 8 , wherein the training the neural network based 3D encoder using the training dataset including the first sample includes: generating image representations using the first image of the first sample; generating text representations using the one or more text descriptions of the first sample; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations. 14. The system of claim 13 , wherein the image representations and the text representations are generated using a pretrained vision and language model. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a 3D model dataset including a plurality of 3D models; generating a plurality of samples of the training dataset using the plurality of 3D models of the 3D model dataset, wherein the generating the plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of image candidates having different viewpoints of a first 3D model; randomly selecting a first image from the plurality of image candidates; randomly choosing a word from metadata associated with the first 3D model; generating, using a language model, one or more text descriptions using the selected word and a plurality of prompts, wherein the plurality of prompts include a prompt indicating a 3D modality; generating a point cloud by randomly sampling points in the 3D model; generating a first sample including the first image, one or more text descriptions, and the point cloud; and training a neural network based 3D encoder using the training dataset including the first sample. 16. The non-transitory machine-readable medium of claim 15 , wherein the generating the point cloud includes: performing augmentation to the point cloud to generate an augmented point cloud; wherein the point cloud of the first sample includes the augmented point cloud. 17. The non-transitory machine-readable medium of claim 16 , wherein the augmentation performed to the point cloud includes one of a random point drop augmentation, a random scaling point cloud augmentation, a shift point cloud augmentation, and a rotate perturbation augmentation. 18. The non-transitory machine-readable medium of claim 15 , wherein the first image includes an RGB image or a depth map. 19. The non-transitory machine-readable medium of claim 15 , wherein the training the neural network based 3D encoder using the training dataset including the first sample includes: generating image representations using the first image of the first sample; generating text representations using the one or more text descriptions of the first sample; generating 3D representations using the point cloud of the first sam
using classification, e.g. of video objects · CPC title
Validation; Performance evaluation · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
using neural networks · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.