What technology area does this patent fall under?

Primary CPC classification G06V10/82. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 07 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generating images using sequences of generative neural networks

US11978141B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11978141-B2
Application number	US-202318199883-A
Country	US
Kind code	B2
Filing date	May 19, 2023
Priority date	May 19, 2022
Publication date	May 7, 2024
Grant date	May 7, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving an input text prompt comprising a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of diffusion-based generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial output image having an initial resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising: (i) the contextual embeddings, and (ii) a respective input image having a respective input resolution and generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output image having a respective output resolution that is higher than the respective input resolution, wherein for each subsequent diffusion-based generative neural network, processing the respective input to generate the respective output image comprises: sampling a latent image having the respective output resolution; and denoising the latent image over a sequence of steps into the respective output image, comprising, for each step that is not a final step in the sequence of steps: receiving a latent image for the step; processing the respective input and the latent image for the step to generate an estimated image for the step; dynamically thresholding pixel values of the estimated image for the step; and generating a latent image for a next step using at least the estimated image for the step. 2. The method of claim 1 , wherein the text encoder neural network is a self-attention encoder neural network. 3. The method of claim 1 , wherein: the diffusion-based generative neural networks in the sequence have been trained jointly on a set of training examples that each include: (i) a respective training text prompt, and (ii) a respective ground truth image that depicts a scene described by the respective training text prompt; and the text encoder neural network has been pre-trained and was held frozen during the joint training of the diffusion-based generative neural networks in the sequence. 4. The method of claim 1 , wherein the diffusion-based generative neural networks have been trained using classifier-free guidance. 5. The method of claim 1 , wherein each diffusion-based generative neural network uses a v-prediction parametrization to generate the respective output image. 6. The method of claim 1 , wherein each diffusion-based generative neural network uses progressive distillation to generate the respective output image. 7. The method of claim 1 , wherein denoising the latent image over the sequence of steps into the respective output image further comprises, for the final step in the sequence of steps: receiving a latent image for the final step; and processing the respective input and the latent image for the final step to generate the respective output image. 8. The method of claim 1 , wherein processing the respective input and the latent image for the step to generate the estimated image for the step comprises: resizing the respective input image to generate a respective resized input image having the respective output resolution; concatenating the latent image for the step with the respective resized input image to generate a concatenated image for the step; and processing the concatenated image for the step with cross-attention on the contextual embeddings to generate the estimated image for the step. 9. The method of claim 1 , wherein dynamically thresholding the pixel values of the estimated image for the step comprises: determining a clipping threshold based on the pixel values of the estimated image for the step; and thresholding the pixel values of the estimated image for the step using the clipping threshold. 10. The method of claim 9 , wherein determining the clipping threshold based on the pixel values of the estimated image for the step comprises: determining the clipping threshold based on a particular percentile absolute pixel value in the estimated image for the step. 11. The method of claim 9 , wherein thresholding the pixel values of the estimated image for the step using the clipping threshold comprises: clipping the pixel values of the estimated image for the step to a range defined by [−κ,κ], wherein κ is the clipping threshold. 12. The method of claim 11 , wherein thresholding the pixel values of the estimated image for the step using the clipping threshold further comprises: after clipping the pixel values of the estimated image for the step, dividing the pixel values of the estimated image for the step by the clipping threshold. 13. The method of claim 1 , wherein each subsequent diffusion-based generative neural network applies noise conditioning augmentation to the respective input image. 14. The method of claim 1 , wherein the final output image is the respective output image of a final diffusion-based generative neural network in the sequence. 15. The method claim 1 , wherein each subsequent diffusion-based generative neural network receives a respective k×k input image and generates a respective 4k×4k output image. 16. The method of claim 1 , wherein the one or more subsequent diffusion-based generative neural networks comprise a plurality of subsequent diffusion-based generative neural networks. 17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving an input text prompt comprising a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of diffusion-based generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial output image having an initial resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising: (i) the contextual embeddings, and (ii) a respective input image having a respective input resolution and generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output image having a respective output resolution that is higher than the respective input resolution, wherein for each subsequent diffusion-based generative neural network, processing the respective input to generate the respective output image comprises: sampling a latent image having the respective output resolution; and denoising the latent image over a sequence of s

Assignees

Google Llc

Inventors

Classifications

G06V10/82Primary
using neural networks · CPC title
G06V30/1983
Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title
G06T2211/441
AI-based methods, deep learning or artificial neural networks · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

View patent family 86896064

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11978141B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processin…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 07 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback

Generating animated digital videos utilizing a character animation neural network informed by pose and motion embeddings

Industrial digital twin systems and methods with echelons of executive, advisory and operations messaging and visualization

Expressive text-to-speech utilizing contextual word-level style tokens

Automatically merging people and objects from multiple digital images to generate a composite digital image

Training a convolutional neural network for image retrieval with a listwise ranking loss function

Controllable conditional image generation

Frequently asked questions