Self-training on unpaired data for vision-language models

US2026065649A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2026065649-A1
Application numberUS-202418825123-A
CountryUS
Kind codeA1
Filing dateSep 5, 2024
Priority dateSep 5, 2024
Publication dateMar 5, 2026
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, apparatus, non-transitory computer readable medium, and system for caption generation includes obtaining training data including an input image depicting a scene and training, using the training data, a captioning model to generate a text caption describing the scene. Training the captioning model comprises training an image encoder of the captioning model to encode the input image to obtain an image embedding representing the scene and training a language decoder of the captioning model to generate the text caption based on the image embedding, wherein the captioning model is trained based on an output of the language decoder.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of training a machine learning model, the method comprising: obtaining training data including an input image depicting a scene; and training, using the training data, a captioning model to generate a text caption describing the scene, wherein training the captioning model comprises: training an image encoder of the captioning model to encode the input image to obtain an image embedding representing the scene; and training a language decoder of the captioning model to generate the text caption based on the image embedding, wherein the captioning model is trained based on an output of the language decoder. 2 . The method of claim 1 , wherein training the captioning model further comprises: iteratively generating synthetic captions using the captioning model and updating the captioning model based on the synthetic captions. 3 . The method of claim 1 , wherein encoding the input image comprises: generating a plurality of local embeddings corresponding to a plurality of regions of the input image, respectively, wherein the image embedding comprises one of the plurality of local embeddings. 4 . The method of claim 1 , wherein generating the text caption comprises: autoregressively decoding the image embedding. 5 . The method of claim 1 , further comprising: obtaining an input prompt; and encoding, using a language encoder of the captioning model, the input prompt to obtain a text embedding. 6 . The method of claim 4 , wherein the image embedding and the text embedding are in a same embedding space. 7 . A non-transitory computer readable medium storing code for training a machine learning model, the code comprising instructions executable by at least one processor to perform operations comprising: obtaining training data including an input image; training, using the training data, a first captioning model to generate a synthetic caption based on the input image; generating an augmented caption based on the synthetic caption; and using the synthetic caption and the augmented caption to train a second captioning model. 8 . The non-transitory computer readable medium of claim 7 , wherein generating the augmented caption comprises: generating the augmented caption using a language generation model. 9 . The non-transitory computer readable medium of claim 7 , wherein training the second captioning model comprises: identifying a positive pair comprising the input image and the synthetic caption or the augmented caption; and identifying a negative pair comprising the training image and an additional caption corresponding to an additional training image different from the training image. 10 . The non-transitory computer readable medium of claim 9 , wherein training the second captioning model further comprises: computing a contrastive loss based on the positive pair and the negative pair; and updating parameters of the second captioning model based on the contrastive loss. 11 . The method of claim 10 , wherein: an image encoder and a language encoder of the second captioning model are updated based on the contrastive loss. 12 . The method of claim 6 , wherein training the second captioning model comprises: autoregressively generating a predicted caption; computing a caption loss based on the predicted caption; and updating parameters of the second captioning model based on the caption loss. 13 . The method of claim 12 , wherein: an image encoder and a language decoder of the second captioning model are updated based on the caption loss. 14 . The method of claim 6 , wherein training the second captioning model comprises: iteratively training the second captioning model, generating synthetic captions, generating augmented captions based on the synthetic captions, and retraining the second captioning model. 15 . An apparatus comprising: at least one processor; at least one memory component coupled with the at least one processor; and a captioning model comprising parameters stored in the at least one memory component and trained to generate a text caption describing an input image, wherein the captioning model is trained by generating a synthetic caption, generating an augmented caption based on the synthetic caption, and training the captioning model using the synthetic caption and the augmented caption. 16 . The apparatus of claim 15 , further comprising: a data engine configured to iteratively generate training data for the captioning model. 17 . The apparatus of claim 15 , wherein: the captioning model comprises an image encoder configured to encode the input image to obtain an image embedding. 18 . The apparatus of claim 16 , wherein: the captioning model comprises a language encoder configured to encode an input prompt to obtain a text embedding. 19 . The apparatus of claim 16 , wherein: the captioning model comprises a language decoder configured to generate a text caption describing the input image. 20 . The apparatus of claim 16 , further comprising: a language generation model configured to generate the augmented caption.

Assignees

Inventors

Classifications

  • Natural language generation · CPC title

  • G06V10/774Primary

    Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • G06V10/776Primary

    Validation; Performance evaluation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026065649A1 cover?
A method, apparatus, non-transitory computer readable medium, and system for caption generation includes obtaining training data including an input image depicting a scene and training, using the training data, a captioning model to generate a text caption describing the scene. Training the captioning model comprises training an image encoder of the captioning model to encode the input image to…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/774. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).