Contextualized speech to text conversion
US-2022360668-A1 · Nov 10, 2022 · US
US12518512B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12518512-B2 |
| Application number | US-202217821596-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 23, 2022 |
| Priority date | Nov 21, 2021 |
| Publication date | Jan 6, 2026 |
| Grant date | Jan 6, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Examples are provided for pre-training a computer vision foundation model. A representative method comprises curating a pre-training database of image-text pairs from weakly labeled data. Language is encoded of text descriptions from the image-text pairs. The images of the image-text pairs are encoded using a hierarchical vision transformer with shifted windows and convolutional embedding. Based on the encoded images and the encoded language, the computer vision foundation model is pre-trained via unified image-text contrastive learning.
Opening claim text (preview).
The invention claimed is: 1 . A method for pre-training a computer vision foundation model, the method comprising: curating a pre-training database of image-text pairs from weakly labeled data; encoding language of text descriptions from the image-text pairs to obtain encoded language; encoding images of the image-text pairs using a hierarchical vision transformer that generates projection layers using convolutional operations and utilizes shifted windows to determine local attention from the projection layers that are generated to obtain encoded images; and pre-training the computer vision foundation model based on the encoded images and the encoded language via a unified image-text contrastive learning module. 2 . The method of claim 1 , wherein the hierarchical vision transformer is a modified swin transformer. 3 . The method of claim 1 , wherein the hierarchical vision transformer utilizes convolutional operations in considering spatial relationships. 4 . The method of claim 1 , further comprising augmenting text descriptions below a threshold length based on prompt templates. 5 . The method of claim 1 , wherein pre-training the computer vision foundation model based on the encoded images and the encoded language comprises pre-training the computer vision foundation model in a first stage wherein augmented text descriptions are included, and in a second stage wherein augmented text descriptions are excluded. 6 . The method of claim 1 , wherein the unified image-text contrastive learning module maps identical language descriptions to a same language label. 7 . The method of claim 1 , further comprising: providing the pre-trained computer vision foundation model to two or more task-specific adapters. 8 . The method of claim 7 , wherein providing the pre-trained computer vision foundation model to two or more task-specific adapters includes a providing a plurality of feature pyramids from different scale levels of the hierarchical vision transformer. 9 . A system for pre-training a computer vision foundation model, comprising: a data curation engine configured to curate a pre-training database of image-text pairs from weakly labeled data; and a pre-training model comprising: a language encoder configured to encode language of text descriptions from the image-text pairs to obtain encoded language; an image encoder configured to encode images of the image-text pairs using a hierarchical vision transformer by generating projection layers using convolutional operations and utilizing shifted windows in determining local attention from the projection layers that are generated to obtain encoded images; and a unified image-text contrastive learning module configured to pre-train the computer vision foundation model based on the encoded images and the encoded language. 10 . The system of claim 9 , wherein the hierarchical vision transformer is a modified swin transformer. 11 . The system of claim 9 , wherein the hierarchical vision transformer utilizes convolutional operations in considering spatial relationships. 12 . The system of claim 9 , wherein the unified image-text contrastive learning module is further configured to augmenting text descriptions below a threshold length based on prompt templates. 13 . The system of claim 9 , wherein pre-training the computer vision foundation model based on the encoded images and the encoded language comprises pre-training the computer vision foundation model in a first stage wherein augmented text descriptions are included, and in a second stage wherein augmented text descriptions are excluded. 14 . The system of claim 9 , wherein the unified image-text contrastive learning module is further configured to map identical language descriptions to a same language label. 15 . The system of claim 9 , wherein the resulting pre-trained computer vision foundation model is provided to two or more task-specific adapters. 16 . The system of claim 9 , wherein providing the pre-trained computer vision foundation model to two or more task-specific adapters includes a providing a plurality of feature pyramids from different scale levels of the hierarchical vision transformer. 17 . The method of claim 1 , wherein the encoded images include information that represents features of images of the image-text pairs in one or more dimensions. 18 . The method of claim 1 , wherein the encoded language includes information that represents features of textual data of the image-text pairs in one or more dimensions. 19 . The method of claim 1 , wherein the projection layers are a subset of a set of layers used to pre-train the computer vision foundation model, and wherein the local attention is not determined from layers outside of the subset.
Character encoding · CPC title
Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title
Templates · CPC title
using neural networks · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.