Training vision models with unified contrastive learning

US12518512B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12518512-B2
Application numberUS-202217821596-A
CountryUS
Kind codeB2
Filing dateAug 23, 2022
Priority dateNov 21, 2021
Publication dateJan 6, 2026
Grant dateJan 6, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Examples are provided for pre-training a computer vision foundation model. A representative method comprises curating a pre-training database of image-text pairs from weakly labeled data. Language is encoded of text descriptions from the image-text pairs. The images of the image-text pairs are encoded using a hierarchical vision transformer with shifted windows and convolutional embedding. Based on the encoded images and the encoded language, the computer vision foundation model is pre-trained via unified image-text contrastive learning.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A method for pre-training a computer vision foundation model, the method comprising: curating a pre-training database of image-text pairs from weakly labeled data; encoding language of text descriptions from the image-text pairs to obtain encoded language; encoding images of the image-text pairs using a hierarchical vision transformer that generates projection layers using convolutional operations and utilizes shifted windows to determine local attention from the projection layers that are generated to obtain encoded images; and pre-training the computer vision foundation model based on the encoded images and the encoded language via a unified image-text contrastive learning module. 2 . The method of claim 1 , wherein the hierarchical vision transformer is a modified swin transformer. 3 . The method of claim 1 , wherein the hierarchical vision transformer utilizes convolutional operations in considering spatial relationships. 4 . The method of claim 1 , further comprising augmenting text descriptions below a threshold length based on prompt templates. 5 . The method of claim 1 , wherein pre-training the computer vision foundation model based on the encoded images and the encoded language comprises pre-training the computer vision foundation model in a first stage wherein augmented text descriptions are included, and in a second stage wherein augmented text descriptions are excluded. 6 . The method of claim 1 , wherein the unified image-text contrastive learning module maps identical language descriptions to a same language label. 7 . The method of claim 1 , further comprising: providing the pre-trained computer vision foundation model to two or more task-specific adapters. 8 . The method of claim 7 , wherein providing the pre-trained computer vision foundation model to two or more task-specific adapters includes a providing a plurality of feature pyramids from different scale levels of the hierarchical vision transformer. 9 . A system for pre-training a computer vision foundation model, comprising: a data curation engine configured to curate a pre-training database of image-text pairs from weakly labeled data; and a pre-training model comprising: a language encoder configured to encode language of text descriptions from the image-text pairs to obtain encoded language; an image encoder configured to encode images of the image-text pairs using a hierarchical vision transformer by generating projection layers using convolutional operations and utilizing shifted windows in determining local attention from the projection layers that are generated to obtain encoded images; and a unified image-text contrastive learning module configured to pre-train the computer vision foundation model based on the encoded images and the encoded language. 10 . The system of claim 9 , wherein the hierarchical vision transformer is a modified swin transformer. 11 . The system of claim 9 , wherein the hierarchical vision transformer utilizes convolutional operations in considering spatial relationships. 12 . The system of claim 9 , wherein the unified image-text contrastive learning module is further configured to augmenting text descriptions below a threshold length based on prompt templates. 13 . The system of claim 9 , wherein pre-training the computer vision foundation model based on the encoded images and the encoded language comprises pre-training the computer vision foundation model in a first stage wherein augmented text descriptions are included, and in a second stage wherein augmented text descriptions are excluded. 14 . The system of claim 9 , wherein the unified image-text contrastive learning module is further configured to map identical language descriptions to a same language label. 15 . The system of claim 9 , wherein the resulting pre-trained computer vision foundation model is provided to two or more task-specific adapters. 16 . The system of claim 9 , wherein providing the pre-trained computer vision foundation model to two or more task-specific adapters includes a providing a plurality of feature pyramids from different scale levels of the hierarchical vision transformer. 17 . The method of claim 1 , wherein the encoded images include information that represents features of images of the image-text pairs in one or more dimensions. 18 . The method of claim 1 , wherein the encoded language includes information that represents features of textual data of the image-text pairs in one or more dimensions. 19 . The method of claim 1 , wherein the projection layers are a subset of a set of layers used to pre-train the computer vision foundation model, and wherein the local attention is not determined from layers outside of the subset.

Assignees

Inventors

Classifications

  • Character encoding · CPC title

  • G06T9/00Primary

    Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title

  • Templates · CPC title

  • using neural networks · CPC title

  • G06V10/774Primary

    Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12518512B2 cover?
Examples are provided for pre-training a computer vision foundation model. A representative method comprises curating a pre-training database of image-text pairs from weakly labeled data. Language is encoded of text descriptions from the image-text pairs. The images of the image-text pairs are encoded using a hierarchical vision transformer with shifted windows and convolutional embedding. Base…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06T9/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 06 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).