Image generation using a text and image conditioned machine learning model

US12586259B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586259-B2
Application numberUS-202418426763-A
CountryUS
Kind codeB2
Filing dateJan 30, 2024
Priority dateMar 20, 2023
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, apparatus, non-transitory computer readable medium, and system for image generation include obtaining a text embedding of a text prompt and an image embedding of an image prompt. Some embodiments map the text embedding into a joint embedding space to obtain a joint text embedding and map the image embedding into the joint embedding space to obtain a joint image embedding. Some embodiments generate a synthetic image based on the joint text embedding and the joint image embedding.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for image generation, comprising: obtaining a text embedding of a text prompt in a text embedding space and an image embedding of an image prompt in an image embedding space; mapping, using a text mapping network, the text embedding from the text embedding space into a joint embedding space to obtain a joint text embedding; mapping, using an image mapping network, the image embedding from the image embedding space into the joint embedding space to obtain a joint image embedding; and generating, using an image generation model, a synthetic image based on the joint text embedding and the joint image embedding. 2 . The method of claim 1 , further comprising: generating, using a generative adversarial network, a high-resolution version of the synthetic image. 3 . The method of claim 1 , wherein obtaining the text embedding and the image embedding comprises: encoding the text prompt with a text encoder to obtain the text embedding; and encoding the image prompt with an image encoder to obtain the image embedding. 4 . The method of claim 1 , further comprising: concatenating the joint text embedding and the joint image embedding to obtain a combined embedding. 5 . The method of claim 4 , wherein: the text embedding comprises n text tokens, where n is greater than one, the image embedding comprises a single image token, and the combined embedding comprises n+1 combined tokens. 6 . The method of claim 5 , wherein: each of the n text tokens has a dimensionality greater than the single image token. 7 . The method of claim 5 , wherein: each of the n+1 combined tokens has a same dimensionality as the n text tokens. 8 . The method of claim 1 , further comprising: learning a default text embedding for a null text prompt. 9 . The method of claim 1 , further comprising: learning a default image embedding for a null image prompt. 10 . A system for image generation, comprising: one or more processors; one or more memory components coupled with the one or more processors; a text mapping network comprising text mapping parameters, the text mapping network trained to map a text embedding from a text embedding space into a joint embedding space to obtain a joint text embedding; an image mapping network comprising image mapping parameters, the image mapping network trained to map an image embedding from an image embedding space into the joint embedding space to obtain a joint image embedding; and an image generation model comprising image generation parameters, the image generation model trained to generate a synthetic image based on the joint text embedding and the joint image embedding. 11 . The system of claim 10 , the system further comprising: a generative adversarial network (GAN) comprising GAN parameters, the GAN trained to generate a high-resolution version of the synthetic image. 12 . The system of claim 10 , wherein: the text mapping network comprises a multi-layer perceptron (MLP) architecture. 13 . The system of claim 10 , wherein: the image mapping network comprises a multi-layer perceptron (MLP) architecture. 14 . The system of claim 10 , the system further comprising: a text encoder comprising text encoding parameters, the text encoder trained to encode a text prompt to obtain the text embedding. 15 . The system of claim 14 , wherein: the text encoder is configured to learn a default text embedding for a null text prompt. 16 . The system of claim 10 , the system further comprising: an image encoder comprising image encoding parameters, the image encoder trained to encode an image prompt to obtain the image embedding. 17 . The system of claim 16 , wherein: the image encoder is configured to learn a default image embedding for a null image prompt. 18 . A non-transitory computer readable medium storing instructions that, when executed by a processor, cause the processor to: obtain a text embedding of a text prompt in a text embedding space and an image embedding of an image prompt in an image embedding space; map, using a text mapping network, the text embedding from the text embedding space into a joint embedding space to obtain a joint text embedding; map, using an image mapping network, the image embedding from the image embedding space into the joint embedding space to obtain a joint image embedding; and generate, using an image generation model, a synthetic image based on the joint text embedding and the joint image embedding. 19 . The non-transitory computer readable medium of claim 18 , wherein the instructions further cause the processor to: generate a high-resolution version of the synthetic image using a generative adversarial network (GAN). 20 . The non-transitory computer readable medium of claim 18 , wherein the instructions further cause the processor to: concatenate the joint text embedding and the joint image embedding to obtain a combined embedding.

Assignees

Inventors

Classifications

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Training; Learning · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • G06T11/00Primary

    Two-dimensional [2D] image generation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586259B2 cover?
A method, apparatus, non-transitory computer readable medium, and system for image generation include obtaining a text embedding of a text prompt and an image embedding of an image prompt. Some embodiments map the text embedding into a joint embedding space to obtain a joint text embedding and map the image embedding into the joint embedding space to obtain a joint image embedding. Some embodim…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06T11/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).