Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback
US-2023230198-A1 · Jul 20, 2023 · US
US11978141B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11978141-B2 |
| Application number | US-202318199883-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 19, 2023 |
| Priority date | May 19, 2022 |
| Publication date | May 7, 2024 |
| Grant date | May 7, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving an input text prompt comprising a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of diffusion-based generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial output image having an initial resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising: (i) the contextual embeddings, and (ii) a respective input image having a respective input resolution and generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output image having a respective output resolution that is higher than the respective input resolution, wherein for each subsequent diffusion-based generative neural network, processing the respective input to generate the respective output image comprises: sampling a latent image having the respective output resolution; and denoising the latent image over a sequence of steps into the respective output image, comprising, for each step that is not a final step in the sequence of steps: receiving a latent image for the step; processing the respective input and the latent image for the step to generate an estimated image for the step; dynamically thresholding pixel values of the estimated image for the step; and generating a latent image for a next step using at least the estimated image for the step. 2. The method of claim 1 , wherein the text encoder neural network is a self-attention encoder neural network. 3. The method of claim 1 , wherein: the diffusion-based generative neural networks in the sequence have been trained jointly on a set of training examples that each include: (i) a respective training text prompt, and (ii) a respective ground truth image that depicts a scene described by the respective training text prompt; and the text encoder neural network has been pre-trained and was held frozen during the joint training of the diffusion-based generative neural networks in the sequence. 4. The method of claim 1 , wherein the diffusion-based generative neural networks have been trained using classifier-free guidance. 5. The method of claim 1 , wherein each diffusion-based generative neural network uses a v-prediction parametrization to generate the respective output image. 6. The method of claim 1 , wherein each diffusion-based generative neural network uses progressive distillation to generate the respective output image. 7. The method of claim 1 , wherein denoising the latent image over the sequence of steps into the respective output image further comprises, for the final step in the sequence of steps: receiving a latent image for the final step; and processing the respective input and the latent image for the final step to generate the respective output image. 8. The method of claim 1 , wherein processing the respective input and the latent image for the step to generate the estimated image for the step comprises: resizing the respective input image to generate a respective resized input image having the respective output resolution; concatenating the latent image for the step with the respective resized input image to generate a concatenated image for the step; and processing the concatenated image for the step with cross-attention on the contextual embeddings to generate the estimated image for the step. 9. The method of claim 1 , wherein dynamically thresholding the pixel values of the estimated image for the step comprises: determining a clipping threshold based on the pixel values of the estimated image for the step; and thresholding the pixel values of the estimated image for the step using the clipping threshold. 10. The method of claim 9 , wherein determining the clipping threshold based on the pixel values of the estimated image for the step comprises: determining the clipping threshold based on a particular percentile absolute pixel value in the estimated image for the step. 11. The method of claim 9 , wherein thresholding the pixel values of the estimated image for the step using the clipping threshold comprises: clipping the pixel values of the estimated image for the step to a range defined by [−κ,κ], wherein κ is the clipping threshold. 12. The method of claim 11 , wherein thresholding the pixel values of the estimated image for the step using the clipping threshold further comprises: after clipping the pixel values of the estimated image for the step, dividing the pixel values of the estimated image for the step by the clipping threshold. 13. The method of claim 1 , wherein each subsequent diffusion-based generative neural network applies noise conditioning augmentation to the respective input image. 14. The method of claim 1 , wherein the final output image is the respective output image of a final diffusion-based generative neural network in the sequence. 15. The method claim 1 , wherein each subsequent diffusion-based generative neural network receives a respective k×k input image and generates a respective 4k×4k output image. 16. The method of claim 1 , wherein the one or more subsequent diffusion-based generative neural networks comprise a plurality of subsequent diffusion-based generative neural networks. 17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving an input text prompt comprising a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of diffusion-based generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial output image having an initial resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising: (i) the contextual embeddings, and (ii) a respective input image having a respective input resolution and generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output image having a respective output resolution that is higher than the respective input resolution, wherein for each subsequent diffusion-based generative neural network, processing the respective input to generate the respective output image comprises: sampling a latent image having the respective output resolution; and denoising the latent image over a sequence of s
using neural networks · CPC title
Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title
AI-based methods, deep learning or artificial neural networks · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.