What technology area does this patent fall under?

Primary CPC classification G06T11/10. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Nov 21 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Mask conditioned image transformation based on a text prompt

US2024386627A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2024386627-A1
Application number	US-202318319808-A
Country	US
Kind code	A1
Filing date	May 18, 2023
Priority date	May 18, 2023
Publication date	Nov 21, 2024
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In accordance with the described techniques, an image transformation system receives an input image and a text prompt, and leverages a generator network to edit the input image based on the text prompt. The generator network includes a plurality of layers configured to perform respective edits. A plurality of masks are generated based on the text prompt that define local edit regions, respectively, of the input image for respective layers of the generator network. Further, the generator network generates an edited image by editing the input image based on the plurality of masks, the respective edits of the respective layers, and the text prompt.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: receiving, by a processing device, a text prompt and an input image by a generator network, the generator network including a plurality of layers configured to perform respective edits; generating, by the processing device, a plurality of masks defining local edit regions, respectively, of the input image for respective layers of the plurality of layers, the plurality of masks based on the text prompt; generating, by the processing device using the generator network, an edited image by editing the input image based on the plurality of masks and the respective edits of the respective layers based on the text prompt; and outputting, by the processing device, the edited image. 2 . The method of claim 1 , wherein the generating the plurality of masks includes segmenting, using a segmentation network, the input image into multiple semantic segments that each identify a different portion of a subject depicted in the input image. 3 . The method of claim 2 , wherein the generating the plurality of masks includes generating a matrix having columns that represent different layers of the generator network, rows that represent different semantic segments of the multiple semantic segments, and entries populated with confidence values indicating degrees of likelihood that the respective layers affect corresponding semantic segments based on the text prompt. 4 . The method of claim 3 , wherein the generating the plurality of masks includes selecting, as the local edit regions for the respective layers, one or more semantic segments having confidence values in respective columns of the matrix that exceed a threshold. 5 . The method of claim 1 , wherein the generating the plurality of masks is performed using convolutional neural networks associated with the respective layers, the generating the plurality of masks further including conditioning the convolutional neural networks on the text prompt and unedited features output by the respective layers. 6 . The method of claim 1 , wherein the generating the edited image includes: determining latent edit vectors for the respective layers based on the text prompt; generating combined latent vectors for the respective layers by combining the latent edit vectors with a latent vector that defines the input image; and editing, by the respective layers, the input image based on the combined latent vectors. 7 . The method of claim 6 , wherein the generating the edited image includes: outputting, by the plurality of layers, unedited features based on the latent vector; outputting, by the plurality of layers, edited features based on respective combined latent vectors of the combined latent vectors; and generating blended features for the plurality of layers by blending the edited features and the unedited features based on the plurality of masks, the blended features including respective edited features in the local edit regions and respective unedited features outside the local edit regions, the edited image incorporating the blended features. 8 . The method of claim 7 , wherein the outputting the unedited features and the outputting the edited features includes conditioning the plurality of layers on the blended features output by previous layers of the generator network. 9 . The method of claim 7 , wherein one or more masks generated for one or more layers are zero masks indicating that the one or more layers do not affect the input image based on the text prompt, and the blended features generated for the one or more layers are the unedited features output by the one or more layers. 10 . The method of claim 6 , wherein the determining the latent edit vectors includes determining, using one or more machine learning mapper models, the latent edit vectors based on the text prompt and the latent vector, the latent edit vectors being dependent on the input image. 11 . The method of claim 6 , wherein the determining the latent edit vectors includes determining a global direction for the latent edit vectors, the latent edit vectors being independent of the input image. 12 . The method of claim 6 , wherein the generating the plurality of masks and the determining the latent edit vectors is performed using one or more machine learning models. 13 . The method of claim 12 , further comprising: generating an additional edited image by editing the input image based on the respective edits of the plurality of layers and the text prompt without using the plurality of masks; determining, using a contrastive language-image pre-training model, a first measure of similarity between the edited image and the text prompt and a second measure of similarity between the additional edited image and the text prompt; and training the one or more machine learning models based on the first and second measures of similarity. 14 . The method of claim 12 , further comprising training the one or more machine learning models based on squared Euclidean norms of the latent edit vectors. 15 . The method of claim 12 , further comprising training the one or more machine learning models based on a size of the local edit regions in the plurality of masks. 16 . A system, comprising: a processing device; and a computer-readable media storing instructions that, responsive to execution by the processing device, cause the processing device to perform operations including: receiving a text prompt and an input image by a generator network, the generator network including a plurality of layers configured to perform respective edits; generating, for a layer of the plurality of layers, a mask defining a local edit region of the input image by segmenting the input image into semantic segments and selecting at least one semantic segment as the local edit region based on the text prompt and the respective edits of the layer; generating a feature of an edited image by editing, using the layer, the input image based on the text prompt and the mask; and generating the edited image by incorporating the feature into the edited image. 17 . The system of claim 16 , wherein the generating the mask includes generating a matrix having columns that represent different layers of the generator network, rows that represent different semantic segments, and entries populated with confidence values indicating degrees of likelihood that respective layers affect corresponding semantic segments based on the text prompt and the respective edits of the respective layers. 18 . The system of claim 17 , wherein the selecting the at least one semantic segment includes selecting at least one entry from among the entries in a column associated with the layer, the at least one semantic segment having a confidence value that exceeds a threshold. 19 . A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a text prompt and an input image by a generator network, the generator network including a plurality of layers; generating, for a layer of the plurality of layers, a mask defining a local edit region of the input image by conditioning a convolutional neural network associated with the layer on the text prompt and an unedited feature output using the layer; generating a feature of an edited image by editing, using the layer, the input image based on the text prompt and the mask; and generating the edited image by incorporating the feature into the edited imag

Assignees

Adobe Inc

Inventors

Classifications

G06T11/10Primary
Texturing; Colouring; Generation of textures or colours (retouching, inpainting or scratch removal G06T5/77) · CPC title
G06T7/11
Region-based segmentation · CPC title
G06T11/60
Creating or editing images; Combining images with text · CPC title
G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
G06V20/70
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

Patent family

Related publications grouped by family.

View patent family 93464839

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024386627A1 cover?: In accordance with the described techniques, an image transformation system receives an input image and a text prompt, and leverages a generator network to edit the input image based on the text prompt. The generator network includes a plurality of layers configured to perform respective edits. A plurality of masks are generated based on the text prompt that define local edit regions, respectiv…
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06T11/10. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Nov 21 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Multi-modal image editing

Image manipulation by text instruction

Semantic image synthesis for generating substantially photorealistic images using neural networks

Frequently asked questions