Multi-modal image editing
US-2024169622-A1 · May 23, 2024 · US
US2024386627A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024386627-A1 |
| Application number | US-202318319808-A |
| Country | US |
| Kind code | A1 |
| Filing date | May 18, 2023 |
| Priority date | May 18, 2023 |
| Publication date | Nov 21, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In accordance with the described techniques, an image transformation system receives an input image and a text prompt, and leverages a generator network to edit the input image based on the text prompt. The generator network includes a plurality of layers configured to perform respective edits. A plurality of masks are generated based on the text prompt that define local edit regions, respectively, of the input image for respective layers of the generator network. Further, the generator network generates an edited image by editing the input image based on the plurality of masks, the respective edits of the respective layers, and the text prompt.
Opening claim text (preview).
What is claimed is: 1 . A method, comprising: receiving, by a processing device, a text prompt and an input image by a generator network, the generator network including a plurality of layers configured to perform respective edits; generating, by the processing device, a plurality of masks defining local edit regions, respectively, of the input image for respective layers of the plurality of layers, the plurality of masks based on the text prompt; generating, by the processing device using the generator network, an edited image by editing the input image based on the plurality of masks and the respective edits of the respective layers based on the text prompt; and outputting, by the processing device, the edited image. 2 . The method of claim 1 , wherein the generating the plurality of masks includes segmenting, using a segmentation network, the input image into multiple semantic segments that each identify a different portion of a subject depicted in the input image. 3 . The method of claim 2 , wherein the generating the plurality of masks includes generating a matrix having columns that represent different layers of the generator network, rows that represent different semantic segments of the multiple semantic segments, and entries populated with confidence values indicating degrees of likelihood that the respective layers affect corresponding semantic segments based on the text prompt. 4 . The method of claim 3 , wherein the generating the plurality of masks includes selecting, as the local edit regions for the respective layers, one or more semantic segments having confidence values in respective columns of the matrix that exceed a threshold. 5 . The method of claim 1 , wherein the generating the plurality of masks is performed using convolutional neural networks associated with the respective layers, the generating the plurality of masks further including conditioning the convolutional neural networks on the text prompt and unedited features output by the respective layers. 6 . The method of claim 1 , wherein the generating the edited image includes: determining latent edit vectors for the respective layers based on the text prompt; generating combined latent vectors for the respective layers by combining the latent edit vectors with a latent vector that defines the input image; and editing, by the respective layers, the input image based on the combined latent vectors. 7 . The method of claim 6 , wherein the generating the edited image includes: outputting, by the plurality of layers, unedited features based on the latent vector; outputting, by the plurality of layers, edited features based on respective combined latent vectors of the combined latent vectors; and generating blended features for the plurality of layers by blending the edited features and the unedited features based on the plurality of masks, the blended features including respective edited features in the local edit regions and respective unedited features outside the local edit regions, the edited image incorporating the blended features. 8 . The method of claim 7 , wherein the outputting the unedited features and the outputting the edited features includes conditioning the plurality of layers on the blended features output by previous layers of the generator network. 9 . The method of claim 7 , wherein one or more masks generated for one or more layers are zero masks indicating that the one or more layers do not affect the input image based on the text prompt, and the blended features generated for the one or more layers are the unedited features output by the one or more layers. 10 . The method of claim 6 , wherein the determining the latent edit vectors includes determining, using one or more machine learning mapper models, the latent edit vectors based on the text prompt and the latent vector, the latent edit vectors being dependent on the input image. 11 . The method of claim 6 , wherein the determining the latent edit vectors includes determining a global direction for the latent edit vectors, the latent edit vectors being independent of the input image. 12 . The method of claim 6 , wherein the generating the plurality of masks and the determining the latent edit vectors is performed using one or more machine learning models. 13 . The method of claim 12 , further comprising: generating an additional edited image by editing the input image based on the respective edits of the plurality of layers and the text prompt without using the plurality of masks; determining, using a contrastive language-image pre-training model, a first measure of similarity between the edited image and the text prompt and a second measure of similarity between the additional edited image and the text prompt; and training the one or more machine learning models based on the first and second measures of similarity. 14 . The method of claim 12 , further comprising training the one or more machine learning models based on squared Euclidean norms of the latent edit vectors. 15 . The method of claim 12 , further comprising training the one or more machine learning models based on a size of the local edit regions in the plurality of masks. 16 . A system, comprising: a processing device; and a computer-readable media storing instructions that, responsive to execution by the processing device, cause the processing device to perform operations including: receiving a text prompt and an input image by a generator network, the generator network including a plurality of layers configured to perform respective edits; generating, for a layer of the plurality of layers, a mask defining a local edit region of the input image by segmenting the input image into semantic segments and selecting at least one semantic segment as the local edit region based on the text prompt and the respective edits of the layer; generating a feature of an edited image by editing, using the layer, the input image based on the text prompt and the mask; and generating the edited image by incorporating the feature into the edited image. 17 . The system of claim 16 , wherein the generating the mask includes generating a matrix having columns that represent different layers of the generator network, rows that represent different semantic segments, and entries populated with confidence values indicating degrees of likelihood that respective layers affect corresponding semantic segments based on the text prompt and the respective edits of the respective layers. 18 . The system of claim 17 , wherein the selecting the at least one semantic segment includes selecting at least one entry from among the entries in a column associated with the layer, the at least one semantic segment having a confidence value that exceeds a threshold. 19 . A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a text prompt and an input image by a generator network, the generator network including a plurality of layers; generating, for a layer of the plurality of layers, a mask defining a local edit region of the input image by conditioning a convolutional neural network associated with the layer on the text prompt and an unedited feature output using the layer; generating a feature of an edited image by editing, using the layer, the input image based on the text prompt and the mask; and generating the edited image by incorporating the feature into the edited imag
Texturing; Colouring; Generation of textures or colours (retouching, inpainting or scratch removal G06T5/77) · CPC title
Region-based segmentation · CPC title
Creating or editing images; Combining images with text · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.