Contextual image translation using neural networks
US-2021374947-A1 · Dec 2, 2021 · US
US12417385B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12417385-B2 |
| Application number | US-202318182952-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 13, 2023 |
| Priority date | Nov 11, 2022 |
| Publication date | Sep 16, 2025 |
| Grant date | Sep 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for training a neural network based three-dimensional (3D) encoder for 3D classification are provided. A training dataset including a plurality of samples is received, wherein a first sample includes an image, a text, and a point cloud. An image encoder of a pretrained vision and language model is used to generate image representations for the image of the first sample. A text encoder of the pretrained vision and language model is used to generate text representations for the text of the first sample. The neural network based 3D encoder is used to generate 3D representations for the point cloud of the first sample. A loss objective is computed based on the image representations, text representations, and 3D representations. Parameters of the neural network based 3D encoder are updated based on the computed loss objective via backpropagation.
Opening claim text (preview).
What is claimed is: 1. A method of training a neural network based three-dimensional (3D) encoder for 3D classification, the method comprising: receiving, via a data interface, a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using the neural network based 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss objective via backpropagation. 2. The method of claim 1 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 3. The method of claim 2 , wherein the loss objective is used to align the 3D representations with the image representations and the text representations. 4. The method of claim 1 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss objective while the pretrained vision and language model is frozen. 5. The method of claim 1 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 6. The method of claim 1 , wherein the trained 3D encoder is further finetuned with a classification head to perform a 3D classification task. 7. The method of claim 1 , wherein the trained 3D encoder is used with one of the text encoder and the image encoder to perform a zero shot 3D classification task. 8. A system for training a three-dimensional (3D) encoder for 3D classification, the system comprising: a memory that stores the neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using the neural network based 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss objective via backpropagation. 9. The system of claim 8 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 10. The system of claim 9 , wherein the loss objective is used to align the 3D representations with both the image representations and the text representations. 11. The system of claim 8 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss objective while the pretrained vision and language model is frozen. 12. The system of claim 8 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 13. The system of claim 8 , wherein the trained 3D encoder is further finetuned to with a classification head to perform a 3D classification task. 14. The system of claim 8 , wherein the trained 3D encoder is used with one or more of the text encoder and the image encoder to perform a zero shot 3D classification task. 15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a training dataset including a plurality of samples, wherein a first sample includes an image, a text, and a point cloud; generating, using an image encoder of a pretrained vision and language model, image representations for the image of the first sample; generating, using a text encoder of the pretrained vision and language model, text representations for the text of the first sample; generating, using a 3D encoder, 3D representations for the point cloud of the first sample; computing a loss objective based on the image representations, text representations, and 3D representations; and updating parameters of the neural network based 3D encoder based on the computed loss function via backpropagation. 16. The non-transitory machine-readable medium of claim 15 , wherein the loss objective is used to align the 3D representations with at least one of the image representations and the text representations. 17. The non-transitory machine-readable medium of claim 15 , wherein the parameters of the neural network based 3D encoder are updated based on the computed loss function while the pretrained vision and language model is frozen. 18. The non-transitory machine-readable medium of claim 15 , wherein the loss objective includes at least one of a first contrastive loss between the 3D representations and the text representations and a second contrastive loss between the 3D representations and the image representations. 19. The non-transitory machine-readable medium of claim 15 , wherein the trained neural network based 3D encoder is further finetuned to with a classification head to perform a 3D classification task. 20. The non-transitory machine-readable medium of claim 15 , wherein the trained neural network based 3D encoder is used with one or more of the text encoder and the image encoder to perform a zero shot 3D classification task.
using classification, e.g. of video objects · CPC title
Validation; Performance evaluation · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
using neural networks · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.