Generating images using sequences of generative neural networks

US12482160B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12482160-B2
Application numberUS-202418624960-A
CountryUS
Kind codeB2
Filing dateApr 2, 2024
Priority dateMay 19, 2022
Publication dateNov 25, 2025
Grant dateNov 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processing the contextual embeddings through a sequence of generative neural networks to generate a final output image that depicts a scene that is described by the input text prompt.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method performed by one or more computers, the method comprising: receiving a text prompt describing a scene; processing the text prompt, using a frozen text encoder neural network, to generate a set of contextual embeddings of the text prompt; and processing the contextual embeddings, using a sequence of neural networks, to generate an image depicting the scene, wherein the sequence of neural networks comprises: an initial generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial representation of the image having an initial dimensionality; one or more subsequent generative neural networks proceeding the initial generative neural network, each subsequent generative neural network configured to: receive a respective input comprising an input representation of the image generated as output by a preceding neural network in the sequence; and process the respective input to generate, as output, a respective output representation of the image having higher dimensionality than the input representation; and a final neural network following the one or more subsequent generative neural networks, the final neural network configured to: receive an output representation of the image generated by a subsequent generative neural network in the sequence; and process the output representation of the image to generate, as output, the image depicting the scene; wherein the frozen text encoder neural network was held frozen during training of the initial generative neural network and the one or more subsequent generative neural networks, and wherein the training of the initial generative neural network comprises: obtaining a plurality of training examples that each comprise: (i) a respective training text prompt, and (ii) a respective ground truth image that depicts a scene described by the respective training text prompt; pre-computing contextual embeddings of the training text prompts for the plurality of training examples using the frozen text encoder neural network; and training the initial generative neural network on the plurality of training examples using the pre-computed contextual embeddings of the training text prompts without re-computing the contextual embeddings of the training text prompts or modifying parameter values of the frozen text encoder neural network. 2 . The method of claim 1 , wherein the image has higher dimensionality than each representation of the image. 3 . The method of claim 1 , wherein each representation of the image is a respective compressed representation of the image. 4 . The method of claim 3 , wherein each compressed representation of the image is a respective latent representation of the image. 5 . The method of claim 1 , wherein each representation of the image is a respective pixel representation of the image. 6 . The method of claim 1 , wherein the respective input of each subsequent generative neural network further comprises the contextual embeddings. 7 . The method of claim 1 , wherein the initial generative neural network and each subsequent generative neural network is a diffusion-based generative neural network. 8 . The method of claim 7 , wherein the initial diffusion-based generative neural network and each subsequent diffusion-based generative neural network is parameterized in continuous time. 9 . The method of claim 7 , wherein the initial diffusion-based generative neural network and each subsequent diffusion-based generative neural network is a denoising diffusion model. 10 . The method of claim 7 , wherein the initial diffusion-based generative neural network and each subsequent diffusion-based generative neural network has been trained using classifier-free guidance. 11 . The method of claim 7 , wherein the initial diffusion-based generative neural network and each subsequent diffusion-based generative neural network has a convolutional neural network architecture. 12 . The method of claim 11 , wherein the convolutional neural network architecture is a U-Net architecture. 13 . The method of claim 1 , wherein the initial generative neural network and each subsequent generative neural network has 400 million or more network parameters. 14 . The method of claim 1 , wherein the final neural network is a decoder neural network. 15 . The method of claim 1 , wherein the neural networks in the sequence have been trained jointly on the plurality of training examples. 16 . The method of claim 1 , wherein the initial generative neural network, or the one or more subsequent generative neural networks, or both, are diffusion-based generative neural networks that have been trained by progressive distillation. 17 . The method of claim 16 , wherein the initial generative neural network, or the one or more subsequent generative neural networks, or both, implement v-parametrization. 18 . A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving a text prompt describing a scene; processing the text prompt, using a frozen text encoder neural network, to generate a set of contextual embeddings of the text prompt; and processing the contextual embeddings, using a sequence of neural networks, to generate an image depicting the scene, wherein the sequence of neural networks comprises: an initial generative neural network configured to: receive the contextual embeddings; and process the contextual embeddings to generate, as output, an initial representation of the image having an initial dimensionality; one or more subsequent generative neural networks proceeding the initial generative neural network, each subsequent generative neural network configured to: receive a respective input comprising an input representation of the image generated as output by a preceding neural network in the sequence; and process the respective input to generate, as output, a respective output representation of the image having higher dimensionality than the input representation; and a final neural network following the one or more subsequent generative neural networks, the final neural network configured to: receive an output representation of the image generated by a subsequent generative neural network in the sequence; and process the output representation of the image to generate, as output, the image depicting the scene; wherein the frozen text encoder neural network was held frozen during training of the initial generative neural network and the one or more subsequent generative neural networks, and wherein the training of the initial generative neural network comprises: obtaining a plurality of training examples that each comprise: (i) a respective training text prompt, and (ii) a respective ground truth image that depicts a scene described by the respective training text prompt; pre-computing contextual embeddings of the training text prompts for the plurality of training examples using the frozen text encoder neural network; and training the initial generative neural network on the plurality of training examples using the pre-computed contextual embeddings of the training text prompts without re-computing the contextual embeddings of the training text prompts or modifying parameter values of the frozen text encoder neural network. 19 . One or more non-transitory computer-rea

Assignees

Inventors

Classifications

  • G06V10/82Primary

    using neural networks · CPC title

  • Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title

  • Denoising; Smoothing · CPC title

  • based on super-resolution, i.e. the output image resolution being higher than the sensor resolution · CPC title

  • Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12482160B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating images. In one aspect, a method includes: receiving an input text prompt including a sequence of text tokens in a natural language; processing the input text prompt using a text encoder neural network to generate a set of contextual embeddings of the input text prompt; and processin…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).