Memory-optimized contrastive learning

US12400432B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12400432-B2
Application numberUS-202217988655-A
CountryUS
Kind codeB2
Filing dateNov 16, 2022
Priority dateNov 16, 2021
Publication dateAug 26, 2025
Grant dateAug 26, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using memory-optimized contrastive learning to train image encoder and text encoder neural networks.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers and for training an image encoder neural network having image encoder neural network parameters and configured to process an image to generate an image embedding of the image in an embedding space and a text encoder neural network having text encoder neural network parameters and configured to process a text segment to generate a text embedding of the text segment in the embedding space, the method comprising: obtaining a batch of training pairs, each training pair including an input image and an input text segment; obtaining data partitioning the batch of training pairs into a plurality of chunks of training pairs; for each chunk: performing, on a set of one or more computing devices, a first forward pass through the image encoder neural network in accordance with current values of the image encoder neural network parameters on the input images in the training pairs in the chunk to generate a respective image embedding of each input image; performing, on the set of one or more computing devices, a first forward pass through the text encoder neural network in accordance with current values of the text encoder neural network parameters on the input text segments in the training pairs in the chunk to generate a respective text embedding of each text segment; storing, in memory of the set of one or more computing devices, the respective image embeddings and the respective text embeddings without storing intermediate hidden states generated by performing the first forward passes through the image encoder neural network and the text encoder neural network; for each training pair in the batch and using the respective image embeddings and the respective text embeddings for the plurality of chunks stored in the memory of the set of one or more computing devices, generating a respective similarity between the image embedding of the input image in the training pair and the respective text embeddings of the input text segments in all of the training pairs in the batch; determining, for each training pair in the batch, a respective gradient with respect to the image embedding of the input image in the training pair of a contrastive loss function that is based on the respective similarities; for each chunk: performing, on the one or more computing devices, a second forward pass through the image encoder neural network in accordance with current values of the image encoder neural network parameters on the input images in the training pairs in the chunk to re-generate the intermediate hidden states of the image encoder neural network; performing a backward pass through the image encoder neural network using the respective gradients with respect to the image embeddings of the input images in the training pairs in the chunk and the re-generated intermediate hidden states to generate a respective chunked gradient of the contrastive loss function with respect to each of the image encoder neural network parameters; and updating the current values of the image encoder neural network parameters using the respective chunked gradients for the chunks. 2. The method of claim 1 , further comprising: determining, for each training pair in the batch, a respective gradient with respect to the text embedding of the input text segment in the training pair of the contrastive loss function that is based on the respective similarities; for each chunk: performing, on the one or more computing devices, a second forward pass through the text encoder neural network in accordance with current values of the text encoder neural network parameters on the input text segments in the training pairs in the chunk to re-generate the intermediate hidden states of the text encoder neural network; and performing a backward pass through the text encoder neural network using the respective gradients with respect to the text embeddings for the text segments in the training pairs in the chunk and the re-generated intermediate hidden states of the text encoder neural network to generate a respective chunked gradient of the contrastive loss function with respect to each of the text encoder neural network parameters; and updating the current values of the text encoder neural network parameters using the respective chunked gradients for the chunks. 3. The method of claim 1 , wherein updating the current values of the image encoder neural network parameters using the respective chunked gradients for the chunks comprises: applying an optimizer to the current values of the image encoder neural network parameters using the respective chunked gradients. 4. The method of claim 3 , wherein the optimizer maintains respective estimates of one or more gradient moments and determines an update to the current values of the image encoder neural network parameters using the respective estimates, and wherein applying the optimizer comprises: for each chunk and for each of the gradient moments, updating the respective estimate of the gradient moment using the respective chunked gradient for the chunk; and determining the update using the respective estimates of each of the gradient moments after the respective estimates have been updated using the respective chunked gradients for all of the chunks. 5. The method of claim 4 , further comprising, for each chunk and after updating, for each of the gradient moments, the respective estimate of the gradient moment using the respective chunked gradient for the chunk, discarding the respective chunked gradient for the chunk. 6. The method of claim 1 , further comprising: for each chunk, after performing a backward pass through the image encoder neural network using the respective gradients with respect to the image embeddings of the training pairs in the chunk and the re-generated intermediate hidden states, discarding the re-generated intermediate hidden states prior to performing a backward pass for any subsequent chunks. 7. The method of claim 1 , wherein generating a respective similarity between the image embedding of the input image in the training pair and the respective text embeddings of the input text segments in all of the training pairs in the batch comprises, for each particular training pair in the batch: computing a dot product between the image embedding of the input image in the training pair and the text embedding of the text embedding of the input text segment in the particular training pair. 8. The method of claim 1 , wherein determining, for each training pair in the batch, a respective gradient of a contrastive loss function that is based on the respective similarities with respect to the image embedding of the input image in the training pair comprises: determining a respective gradient of the contrastive loss function with respect to each respective similarity between any two input image-input text segment pairs in the batch; and determining the respective gradients of the contrastive loss function with respect to the image embeddings for the input images in the training pairs in the batch from the respective gradients of the contrastive loss function with respect to the respective similarities and the image embeddings. 9. The method of claim 1 , further comprising: prior to jointly training the image encoder neural network and the text encoder neural network, training an image classification model that includes the image encoder neural network on an image classification task, wherein the current values of the image encoder neural network parameters are determined based on values of the parameters after training the image classification model. 10. The method of claim 1 , further comprising: after training the image encoder neural n

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • Validation; Performance evaluation · CPC title

  • using neural networks · CPC title

  • Character encoding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12400432B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using memory-optimized contrastive learning to train image encoder and text encoder neural networks.
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/774. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 26 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).