Synthetic-to-realistic image conversion using generative adversarial network (gan) or other machine learning model
US-2024428568-A1 · Dec 26, 2024 · US
US2026065649A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2026065649-A1 |
| Application number | US-202418825123-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 5, 2024 |
| Priority date | Sep 5, 2024 |
| Publication date | Mar 5, 2026 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, apparatus, non-transitory computer readable medium, and system for caption generation includes obtaining training data including an input image depicting a scene and training, using the training data, a captioning model to generate a text caption describing the scene. Training the captioning model comprises training an image encoder of the captioning model to encode the input image to obtain an image embedding representing the scene and training a language decoder of the captioning model to generate the text caption based on the image embedding, wherein the captioning model is trained based on an output of the language decoder.
Opening claim text (preview).
What is claimed is: 1 . A method of training a machine learning model, the method comprising: obtaining training data including an input image depicting a scene; and training, using the training data, a captioning model to generate a text caption describing the scene, wherein training the captioning model comprises: training an image encoder of the captioning model to encode the input image to obtain an image embedding representing the scene; and training a language decoder of the captioning model to generate the text caption based on the image embedding, wherein the captioning model is trained based on an output of the language decoder. 2 . The method of claim 1 , wherein training the captioning model further comprises: iteratively generating synthetic captions using the captioning model and updating the captioning model based on the synthetic captions. 3 . The method of claim 1 , wherein encoding the input image comprises: generating a plurality of local embeddings corresponding to a plurality of regions of the input image, respectively, wherein the image embedding comprises one of the plurality of local embeddings. 4 . The method of claim 1 , wherein generating the text caption comprises: autoregressively decoding the image embedding. 5 . The method of claim 1 , further comprising: obtaining an input prompt; and encoding, using a language encoder of the captioning model, the input prompt to obtain a text embedding. 6 . The method of claim 4 , wherein the image embedding and the text embedding are in a same embedding space. 7 . A non-transitory computer readable medium storing code for training a machine learning model, the code comprising instructions executable by at least one processor to perform operations comprising: obtaining training data including an input image; training, using the training data, a first captioning model to generate a synthetic caption based on the input image; generating an augmented caption based on the synthetic caption; and using the synthetic caption and the augmented caption to train a second captioning model. 8 . The non-transitory computer readable medium of claim 7 , wherein generating the augmented caption comprises: generating the augmented caption using a language generation model. 9 . The non-transitory computer readable medium of claim 7 , wherein training the second captioning model comprises: identifying a positive pair comprising the input image and the synthetic caption or the augmented caption; and identifying a negative pair comprising the training image and an additional caption corresponding to an additional training image different from the training image. 10 . The non-transitory computer readable medium of claim 9 , wherein training the second captioning model further comprises: computing a contrastive loss based on the positive pair and the negative pair; and updating parameters of the second captioning model based on the contrastive loss. 11 . The method of claim 10 , wherein: an image encoder and a language encoder of the second captioning model are updated based on the contrastive loss. 12 . The method of claim 6 , wherein training the second captioning model comprises: autoregressively generating a predicted caption; computing a caption loss based on the predicted caption; and updating parameters of the second captioning model based on the caption loss. 13 . The method of claim 12 , wherein: an image encoder and a language decoder of the second captioning model are updated based on the caption loss. 14 . The method of claim 6 , wherein training the second captioning model comprises: iteratively training the second captioning model, generating synthetic captions, generating augmented captions based on the synthetic captions, and retraining the second captioning model. 15 . An apparatus comprising: at least one processor; at least one memory component coupled with the at least one processor; and a captioning model comprising parameters stored in the at least one memory component and trained to generate a text caption describing an input image, wherein the captioning model is trained by generating a synthetic caption, generating an augmented caption based on the synthetic caption, and training the captioning model using the synthetic caption and the augmented caption. 16 . The apparatus of claim 15 , further comprising: a data engine configured to iteratively generate training data for the captioning model. 17 . The apparatus of claim 15 , wherein: the captioning model comprises an image encoder configured to encode the input image to obtain an image embedding. 18 . The apparatus of claim 16 , wherein: the captioning model comprises a language encoder configured to encode an input prompt to obtain a text embedding. 19 . The apparatus of claim 16 , wherein: the captioning model comprises a language decoder configured to generate a text caption describing the input image. 20 . The apparatus of claim 16 , further comprising: a language generation model configured to generate the augmented caption.
Natural language generation · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Validation; Performance evaluation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.