Generating images using sequences of generative neural networks

US11978141B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11978141-B2
Application numberUS-202318199883-A
CountryUS
Kind codeB2
Filing dateMay 19, 2023
Priority dateMay 19, 2022
Publication dateMay 7, 2024
Grant dateMay 7, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving an input text prompt comprising a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of diffusion-based generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial output image having an initial resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising: (i) the contextual embeddings, and (ii) a respective input image having a respective input resolution and generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output image having a respective output resolution that is higher than the respective input resolution, wherein for each subsequent diffusion-based generative neural network, processing the respective input to generate the respective output image comprises: sampling a latent image having the respective output resolution; and denoising the latent image over a sequence of steps into the respective output image, comprising, for each step that is not a final step in the sequence of steps: receiving a latent image for the step; processing the respective input and the latent image for the step to generate an estimated image for the step; dynamically thresholding pixel values of the estimated image for the step; and generating a latent image for a next step using at least the estimated image for the step. 2. The method of claim 1 , wherein the text encoder neural network is a self-attention encoder neural network. 3. The method of claim 1 , wherein: the diffusion-based generative neural networks in the sequence have been trained jointly on a set of training examples that each include: (i) a respective training text prompt, and (ii) a respective ground truth image that depicts a scene described by the respective training text prompt; and the text encoder neural network has been pre-trained and was held frozen during the joint training of the diffusion-based generative neural networks in the sequence. 4. The method of claim 1 , wherein the diffusion-based generative neural networks have been trained using classifier-free guidance. 5. The method of claim 1 , wherein each diffusion-based generative neural network uses a v-prediction parametrization to generate the respective output image. 6. The method of claim 1 , wherein each diffusion-based generative neural network uses progressive distillation to generate the respective output image. 7. The method of claim 1 , wherein denoising the latent image over the sequence of steps into the respective output image further comprises, for the final step in the sequence of steps: receiving a latent image for the final step; and processing the respective input and the latent image for the final step to generate the respective output image. 8. The method of claim 1 , wherein processing the respective input and the latent image for the step to generate the estimated image for the step comprises: resizing the respective input image to generate a respective resized input image having the respective output resolution; concatenating the latent image for the step with the respective resized input image to generate a concatenated image for the step; and processing the concatenated image for the step with cross-attention on the contextual embeddings to generate the estimated image for the step. 9. The method of claim 1 , wherein dynamically thresholding the pixel values of the estimated image for the step comprises: determining a clipping threshold based on the pixel values of the estimated image for the step; and thresholding the pixel values of the estimated image for the step using the clipping threshold. 10. The method of claim 9 , wherein determining the clipping threshold based on the pixel values of the estimated image for the step comprises: determining the clipping threshold based on a particular percentile absolute pixel value in the estimated image for the step. 11. The method of claim 9 , wherein thresholding the pixel values of the estimated image for the step using the clipping threshold comprises: clipping the pixel values of the estimated image for the step to a range defined by [−κ,κ], wherein κ is the clipping threshold. 12. The method of claim 11 , wherein thresholding the pixel values of the estimated image for the step using the clipping threshold further comprises: after clipping the pixel values of the estimated image for the step, dividing the pixel values of the estimated image for the step by the clipping threshold. 13. The method of claim 1 , wherein each subsequent diffusion-based generative neural network applies noise conditioning augmentation to the respective input image. 14. The method of claim 1 , wherein the final output image is the respective output image of a final diffusion-based generative neural network in the sequence. 15. The method claim 1 , wherein each subsequent diffusion-based generative neural network receives a respective k×k input image and generates a respective 4k×4k output image. 16. The method of claim 1 , wherein the one or more subsequent diffusion-based generative neural networks comprise a plurality of subsequent diffusion-based generative neural networks. 17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving an input text prompt comprising a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of diffusion-based generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial output image having an initial resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising: (i) the contextual embeddings, and (ii) a respective input image having a respective input resolution and generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output image having a respective output resolution that is higher than the respective input resolution, wherein for each subsequent diffusion-based generative neural network, processing the respective input to generate the respective output image comprises: sampling a latent image having the respective output resolution; and denoising the latent image over a sequence of s

Assignees

Inventors

Classifications

  • G06V10/82Primary

    using neural networks · CPC title

  • Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title

  • AI-based methods, deep learning or artificial neural networks · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11978141B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processin…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 07 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).