Memory-optimized contrastive learning

US2026024318A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2026024318-A1
Application numberUS-202519284474-A
CountryUS
Kind codeA1
Filing dateJul 29, 2025
Priority dateNov 16, 2021
Publication dateJan 22, 2026
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using memory-optimized contrastive learning to train image encoder and text encoder neural networks.

First claim

Opening claim text (preview).

1 . (canceled) 2 . A method performed by one or more computers and for training an image encoder neural network having image encoder neural network parameters and configured to process an image to generate an image embedding of the image in an embedding space and a text encoder neural network having text encoder neural network parameters and configured to process a text segment to generate a text embedding of the text segment in the embedding space, the method comprising: maintaining respective estimates of each of one or more gradient moments; obtaining a batch of training pairs, each training pair including an input image and an input text segment; obtaining data partitioning the batch of training pairs into a plurality of chunks of training pairs; for each chunk: performing, on a set of one or more computing devices, a first forward pass through the image encoder neural network in accordance with current values of the image encoder neural network parameters on the input images in the training pairs in the chunk to generate a respective image embedding of each input image; performing, on the set of one or more computing devices, a first forward pass through the text encoder neural network in accordance with current values of the text encoder neural network parameters on the input text segments in the training pairs in the chunk to generate a respective text embedding of each text segment; for each training pair in the batch and using the respective image embeddings and the respective text embeddings for the plurality of chunks, generating a respective similarity between the image embedding of the input image in the training pair and the respective text embeddings of the input text segments in all of the training pairs in the batch; determining, for each training pair in the batch, a respective gradient with respect to the image embedding of the input image in the training pair of a contrastive loss function that is based on the respective similarities; for each chunk, generating a respective chunked gradient of the contrastive loss function with respect to each of the image encoder neural network parameters; and updating the current values of the image encoder neural network parameters using the respective chunked gradients for the chunks by applying an optimizer to the current values of the image encoder neural network parameters using the respective chunked gradients, comprising: for each chunk and for each of the gradient moments, updating the respective estimate of the gradient moment using the respective chunked gradient for the chunk; and determining the update using the respective estimates of each of the gradient moments after the respective estimates have been updated using the respective chunked gradients for all of the chunks. 3 . The method of claim 2 , further comprising: determining, for each training pair in the batch, a respective gradient with respect to the text embedding of the input text segment in the training pair of the contrastive loss function that is based on the respective similarities; for each chunk: performing, on the one or more computing devices, a second forward pass through the text encoder neural network in accordance with current values of the text encoder neural network parameters on the input text segments in the training pairs in the chunk to re-generate the intermediate hidden states of the text encoder neural network; and performing a backward pass through the text encoder neural network using the respective gradients with respect to the text embeddings for the text segments in the training pairs in the chunk and the re-generated intermediate hidden states of the text encoder neural network to generate a respective chunked gradient of the contrastive loss function with respect to each of the text encoder neural network parameters; and updating the current values of the text encoder neural network parameters using the respective chunked gradients for the chunks. 4 . The method of claim 2 , further comprising: storing, in memory of the set of one or more computing devices, the respective image embeddings and the respective text embeddings without storing intermediate hidden states generated by performing the first forward passes through the image encoder neural network and the text encoder neural network. 5 . The method of claim 4 , wherein generating the respective chunked gradient comprises: performing, on the one or more computing devices, a second forward pass through the image encoder neural network in accordance with current values of the image encoder neural network parameters on the input images in the training pairs in the chunk to re-generate the intermediate hidden states of the image encoder neural network; performing a backward pass through the image encoder neural network using the respective gradients with respect to the image embeddings of the input images in the training pairs in the chunk and the re-generated intermediate hidden states to generate the respective chunked gradient. 6 . The method of claim 2 , further comprising, for each chunk and after updating, for each of the gradient moments, the respective estimate of the gradient moment using the respective chunked gradient for the chunk, discarding the respective chunked gradient for the chunk. 7 . The method of claim 5 , further comprising: for each chunk, after performing a backward pass through the image encoder neural network using the respective gradients with respect to the image embeddings of the training pairs in the chunk and the re-generated intermediate hidden states, discarding the re-generated intermediate hidden states prior to performing a backward pass for any subsequent chunks. 8 . The method of claim 2 , wherein generating a respective similarity between the image embedding of the input image in the training pair and the respective text embeddings of the input text segments in all of the training pairs in the batch comprises, for each particular training pair in the batch: computing a dot product between the image embedding of the input image in the training pair and the text embedding of the text embedding of the input text segment in the particular training pair. 9 . The method of claim 2 , wherein determining, for each training pair in the batch, a respective gradient of a contrastive loss function that is based on the respective similarities with respect to the image embedding of the input image in the training pair comprises: determining a respective gradient of the contrastive loss function with respect to each respective similarity between any two input image-input text segment pairs in the batch; and determining the respective gradients of the contrastive loss function with respect to the image embeddings for the input images in the training pairs in the batch from the respective gradients of the contrastive loss function with respect to the respective similarities and the image embeddings. 10 . The method of claim 2 , further comprising: prior to jointly training the image encoder neural network and the text encoder neural network, training an image classification model that includes the image encoder neural network on an image classification task, wherein the current values of the image encoder neural network parameters are determined based on values of the parameters after training the image classification model. 11 . The method of claim 2 , further comprising: after training the image encoder neural network and the text encoder neural network, using the trained image encoder neural network and the trained text encoder neural network to perform a downstream task. 12 . The method of claim 11 , wher

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • Validation; Performance evaluation · CPC title

  • using neural networks · CPC title

  • Character encoding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026024318A1 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using memory-optimized contrastive learning to train image encoder and text encoder neural networks.
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/774. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 22 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).