Similarity retrieval
US-2024403625-A1 · Dec 5, 2024 · US
US12400432B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12400432-B2 |
| Application number | US-202217988655-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 16, 2022 |
| Priority date | Nov 16, 2021 |
| Publication date | Aug 26, 2025 |
| Grant date | Aug 26, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using memory-optimized contrastive learning to train image encoder and text encoder neural networks.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers and for training an image encoder neural network having image encoder neural network parameters and configured to process an image to generate an image embedding of the image in an embedding space and a text encoder neural network having text encoder neural network parameters and configured to process a text segment to generate a text embedding of the text segment in the embedding space, the method comprising: obtaining a batch of training pairs, each training pair including an input image and an input text segment; obtaining data partitioning the batch of training pairs into a plurality of chunks of training pairs; for each chunk: performing, on a set of one or more computing devices, a first forward pass through the image encoder neural network in accordance with current values of the image encoder neural network parameters on the input images in the training pairs in the chunk to generate a respective image embedding of each input image; performing, on the set of one or more computing devices, a first forward pass through the text encoder neural network in accordance with current values of the text encoder neural network parameters on the input text segments in the training pairs in the chunk to generate a respective text embedding of each text segment; storing, in memory of the set of one or more computing devices, the respective image embeddings and the respective text embeddings without storing intermediate hidden states generated by performing the first forward passes through the image encoder neural network and the text encoder neural network; for each training pair in the batch and using the respective image embeddings and the respective text embeddings for the plurality of chunks stored in the memory of the set of one or more computing devices, generating a respective similarity between the image embedding of the input image in the training pair and the respective text embeddings of the input text segments in all of the training pairs in the batch; determining, for each training pair in the batch, a respective gradient with respect to the image embedding of the input image in the training pair of a contrastive loss function that is based on the respective similarities; for each chunk: performing, on the one or more computing devices, a second forward pass through the image encoder neural network in accordance with current values of the image encoder neural network parameters on the input images in the training pairs in the chunk to re-generate the intermediate hidden states of the image encoder neural network; performing a backward pass through the image encoder neural network using the respective gradients with respect to the image embeddings of the input images in the training pairs in the chunk and the re-generated intermediate hidden states to generate a respective chunked gradient of the contrastive loss function with respect to each of the image encoder neural network parameters; and updating the current values of the image encoder neural network parameters using the respective chunked gradients for the chunks. 2. The method of claim 1 , further comprising: determining, for each training pair in the batch, a respective gradient with respect to the text embedding of the input text segment in the training pair of the contrastive loss function that is based on the respective similarities; for each chunk: performing, on the one or more computing devices, a second forward pass through the text encoder neural network in accordance with current values of the text encoder neural network parameters on the input text segments in the training pairs in the chunk to re-generate the intermediate hidden states of the text encoder neural network; and performing a backward pass through the text encoder neural network using the respective gradients with respect to the text embeddings for the text segments in the training pairs in the chunk and the re-generated intermediate hidden states of the text encoder neural network to generate a respective chunked gradient of the contrastive loss function with respect to each of the text encoder neural network parameters; and updating the current values of the text encoder neural network parameters using the respective chunked gradients for the chunks. 3. The method of claim 1 , wherein updating the current values of the image encoder neural network parameters using the respective chunked gradients for the chunks comprises: applying an optimizer to the current values of the image encoder neural network parameters using the respective chunked gradients. 4. The method of claim 3 , wherein the optimizer maintains respective estimates of one or more gradient moments and determines an update to the current values of the image encoder neural network parameters using the respective estimates, and wherein applying the optimizer comprises: for each chunk and for each of the gradient moments, updating the respective estimate of the gradient moment using the respective chunked gradient for the chunk; and determining the update using the respective estimates of each of the gradient moments after the respective estimates have been updated using the respective chunked gradients for all of the chunks. 5. The method of claim 4 , further comprising, for each chunk and after updating, for each of the gradient moments, the respective estimate of the gradient moment using the respective chunked gradient for the chunk, discarding the respective chunked gradient for the chunk. 6. The method of claim 1 , further comprising: for each chunk, after performing a backward pass through the image encoder neural network using the respective gradients with respect to the image embeddings of the training pairs in the chunk and the re-generated intermediate hidden states, discarding the re-generated intermediate hidden states prior to performing a backward pass for any subsequent chunks. 7. The method of claim 1 , wherein generating a respective similarity between the image embedding of the input image in the training pair and the respective text embeddings of the input text segments in all of the training pairs in the batch comprises, for each particular training pair in the batch: computing a dot product between the image embedding of the input image in the training pair and the text embedding of the text embedding of the input text segment in the particular training pair. 8. The method of claim 1 , wherein determining, for each training pair in the batch, a respective gradient of a contrastive loss function that is based on the respective similarities with respect to the image embedding of the input image in the training pair comprises: determining a respective gradient of the contrastive loss function with respect to each respective similarity between any two input image-input text segment pairs in the batch; and determining the respective gradients of the contrastive loss function with respect to the image embeddings for the input images in the training pairs in the batch from the respective gradients of the contrastive loss function with respect to the respective similarities and the image embeddings. 9. The method of claim 1 , further comprising: prior to jointly training the image encoder neural network and the text encoder neural network, training an image classification model that includes the image encoder neural network on an image classification task, wherein the current values of the image encoder neural network parameters are determined based on values of the parameters after training the image classification model. 10. The method of claim 1 , further comprising: after training the image encoder neural n
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Validation; Performance evaluation · CPC title
using neural networks · CPC title
Character encoding · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.