Multi-modal image editing
US-2024169622-A1 · May 23, 2024 · US
US2024161462A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024161462-A1 |
| Application number | US-202218053556-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 8, 2022 |
| Priority date | Nov 8, 2022 |
| Publication date | May 16, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for image editing are described. Embodiments of the present disclosure include obtaining an image and a prompt for editing the image. A diffusion model is tuned based on the image to generate different versions of the image. The prompt is then encoded to obtain a guidance vector, and the diffusion model generates a modified image based on the image and the encoded text prompt.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: obtaining an image and a prompt for editing the image; encoding the prompt to obtain a guidance vector; and generating a modified image based on the image and the prompt using a diffusion model that has been trained on the image to generate different versions of the image. 2 . The method of claim 1 , further comprising: receiving the prompt from a user via a text field of a user interface; and displaying the modified image to the user via the user interface. 3 . The method of claim 1 , further comprising: initializing a plurality of noise maps; generating a plurality of intermediate images corresponding to the plurality of noise maps at different noise levels based on the plurality of noise maps using the diffusion model; and computing a loss function by comparing each of the plurality of intermediate images to the image, wherein the diffusion model is based on the loss function. 4 . The method of claim 3 , further comprising: selecting the plurality of intermediate images at random from a superset of intermediate images generated by the diffusion model. 5 . The method of claim 3 , further comprising: adding noise at the different noise levels to the image to obtain a plurality of noisy images, wherein the comparison is based on an intermediate image of the plurality of intermediate images and a corresponding noisy image of the plurality of noisy images having a corresponding noise level. 6 . The method of claim 1 , wherein: the prompt comprises text that describes a modification to the image, wherein the modified image includes the modification. 7 . The method of claim 1 , wherein: the modified image retains an identity of an object in the image. 8 . The method of claim 1 , further comprising: combining the guidance vector with image features within the diffusion model, wherein the modified image is based on the guidance vector. 9 . The method of claim 1 , further comprising: initializing the diffusion model; training the diffusion model based on a diverse training set to obtain a pre-trained diffusion model; and fine-tuning the pre-trained diffusion model based on the image. 10 . The method of claim 9 , wherein: the fine-tuning configures the diffusion model to generate an output resembling the image based on any input provided. 11 . The method of claim 9 , wherein: a first weight for a loss function is used for training the diffusion model and a second weight for the loss function that is different from the first weight is used for fine-tuning the pre-trained diffusion model. 12 . A non-transitory computer-readable medium comprising instructions, that, when executed by a processor, are configured to perform operations of: fine-tuning a pre-trained diffusion model based on a single image to obtain a tuned diffusion model; receiving a prompt including additional content for the single image; and generating a modified image based on the single image and the prompt using the tuned diffusion model. 13 . The non-transitory computer-readable medium of claim 12 , wherein the instructions are further configured to perform: initializing a plurality of noise maps; generating a plurality of intermediate images corresponding to the plurality of noise maps at different noise levels based on the plurality of noise maps using the pre-trained diffusion model; and computing a loss function by comparing each of the plurality of intermediate images to the single image, wherein the tuned diffusion model is based on the loss function. 14 . The non-transitory computer-readable medium of claim 13 , wherein the instructions are further configured to perform: selecting the plurality of intermediate images at random from a superset of intermediate images generated by the pre-trained diffusion model. 15 . The non-transitory computer-readable medium of claim 13 , wherein the instructions are further configured to perform: adding noise at the different noise levels to the single image to obtain a plurality of noisy images, wherein the comparison is based on an intermediate image of the plurality of intermediate images and a corresponding noisy image of the plurality of noisy images having a corresponding noise level. 16 . The non-transitory computer-readable medium of claim 12 , wherein the instructions are further configured to perform: encoding the prompt to obtain a guidance vector; and combining the guidance vector with image features within the tuned diffusion model, wherein the modified image is based on the guidance vector. 17 . An apparatus for image processing, comprising: one or more processors; and one or more memories including instructions executable by the one or more processors to: obtain an image and a prompt for editing the image; fine-tune a pre-trained diffusion model based on the image to obtain a tuned diffusion model; and generate a modified image based on the image and the prompt using the tuned diffusion model. 18 . The apparatus of claim 17 , wherein the instructions are further executable by the one or more processors to: encode the prompt to obtain a guidance vector using a text encoder, wherein the modified image is based on the guidance vector. 19 . The apparatus of claim 17 , wherein the instructions are further executable by the one or more processors to: receive the prompt from a user via a text field of a user interface, and display the modified image to the user. 20 . The apparatus of claim 17 , wherein: the diffusion model comprises a Denoising Diffusion Probabilistic Model (DDPM).
Artificial neural networks [ANN] · CPC title
Training; Learning · CPC title
Learning methods · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Semantic analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.