Systems and methods for hierarchical text-conditional image generation
US-11922550-B1 · Mar 5, 2024 · US
US2024355022A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024355022-A1 |
| Application number | US-202318476504-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 28, 2023 |
| Priority date | Apr 20, 2023 |
| Publication date | Oct 24, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
One or more aspects of a method, apparatus, and non-transitory computer readable medium include obtaining an input description and an input image depicting a subject, encoding the input description using a text encoder of an image generation model to obtain a text embedding, and encoding the input image using a subject encoder of the image generation model to obtain a subject embedding. A guidance embedding is generated by combining the subject embedding and the text embedding, and then an output image is generated based on the guidance embedding using a diffusion model of the image generation model. The output image depicts aspects of the subject and the input description.
Opening claim text (preview).
What is claimed is: 1 . A method of generating an image, comprising: obtaining an input description and an input image depicting a subject; encoding the input description using a text encoder of an image generation model to obtain a text embedding; encoding the input image using a subject encoder of the image generation model to obtain a subject embedding; generating a guidance embedding by combining the subject embedding and the text embedding; and generating an output image based on the guidance embedding using a diffusion model of the image generation model, wherein the output image depicts one or more aspects of the input image and the input description. 2 . The method of claim 1 , wherein: the guidance embedding is generated by replacing an identifier in the text embedding with the subject embedding. 3 . The method of claim 1 , further comprising: encoding the input image to obtain a feature embedding representing a subject identity of the input image, wherein the output image is generated based on the feature embedding. 4 . The method of claim 3 , further comprising: introducing the feature embedding into an adapter layer of the image generation model to preserve subject identity in the output image. 5 . The method of claim 1 , wherein: the text embedding is generated using a first multi-modal encoder and the subject embedding is generated using a second multi-modal encoder. 6 . The method of claim 1 , further comprising: applying a balance factor and a renormalization factor to the subject embedding. 7 . The method of claim 6 , wherein: the balance factor is less than 1. 8 . A method of training an image generation model, comprising: obtaining a training data set including a training image; and training an image generation model including a subject encoder and a diffusion model based on the training set, wherein the subject encoder is trained to encode an input image depicting a subject to obtain a subject embedding and the diffusion model is trained to generate an output image depicting the subject based on the subject embedding. 9 . The method of claim 8 , further comprising: generating a feature embedding; and applying a balance factor and a renormalization factor to the feature embedding. 10 . The method of claim 9 , further comprising: setting the balance factor to one. 11 . The method of claim 8 , further comprising: masking out a background of the training image. 12 . The method of claim 11 , further comprising: performing augmentations to the training image to obtain additional training data. 13 . The method of claim 12 , wherein: the diffusion model of the image generation model includes a U-net with adapter layers, wherein parameters of the U-net are fixed during the training. 14 . The method of claim 13 , further comprising: obtaining a latent noisy image from a ground-truth image, wherein the training is based on the latent noisy image and the ground-truth image. 15 . An apparatus comprising: one or more processors; one or more memories including instructions executable by the one or more processors; an image generation model comprising parameters stored in the one or more memories, wherein the image generation model is configured to receive a plurality of images as input, and is trained to generate a new image based on a feature embedding and a text embedding generated from the plurality of images and an input description. 16 . The apparatus of claim 15 , wherein: the image generation model comprises a diffusion model including a U-net with one or more adapter layers. 17 . The apparatus of claim 16 , wherein: parameters of cross-attention layers of the U-net are fixed during training, and the adapter layers are trainable during the training. 18 . The apparatus of claim 17 , wherein: the text embedding is generated using a multi-modal encoder. 19 . The apparatus of claim 18 , wherein: the image generation model is further configured to apply a balance factor and a renormalization factor to the feature embedding. 20 . The apparatus of claim 19 , wherein: the diffusion model is pre-trained and the adapter layers are trained using a plurality of training images and a text description.
Two-dimensional [2D] image generation · CPC title
Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title
involving foreground-background segmentation · CPC title
Training; Learning · CPC title
Creating or editing images; Combining images with text · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.