Generating videos using sequences of generative neural networks

US12277758B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12277758-B2
Application numberUS-202318400856-A
CountryUS
Kind codeB2
Filing dateDec 29, 2023
Priority dateMar 24, 2023
Publication dateApr 15, 2025
Grant dateApr 15, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium. In one aspect, a method includes receiving a text prompt describing a scene; processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and processing the contextual embedding using a sequence of generative neural networks to generate a final video depicting the scene.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving a text prompt describing a scene; processing the text prompt, using a text encoder neural network, to generate a contextual embedding of the text prompt; and processing the contextual embedding, using a sequence of diffusion-based generative neural networks, to generate a final video depicting the scene, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embedding; and process the contextual embedding to generate, as output, an initial output video having: (i) an initial spatial resolution, and (ii) an initial temporal resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising an input video generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output video having at least one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video, wherein the diffusion-based generative neural networks have been jointly trained on training data comprising a plurality of training examples that each include: (i) respective input text describing a respective scene, and (ii) a corresponding target video depicting the respective scene, wherein the text encoder neural network is pre-trained and was held frozen during the joint training of the diffusion-based generated neural networks, and wherein training each subsequent diffusion-based generative neural network on the training data comprised: obtaining a resized version of the respective target video of each training example to generate: (i) a respective training input video, and (ii) a corresponding target output video; and training the subsequent diffusion-based generative neural network using the training input and target output videos of each training example. 2. The method of claim 1 , wherein training the subsequent diffusion-based generative neural network using the training input and target output videos of each training example comprised: processing the respective training input video of each training example, using the subsequent diffusion-based generative neural network, to generate a respective training output video; calculating a gradient of an objective function that depends on the training and target output videos of each training example; and updating a set of network parameters of the subsequent diffusion-based generative neural network according to the gradient of the objective function. 3. The method of claim 1 , wherein training the initial diffusion-based generative neural network on the training data comprised: processing the respective input text of each training example, using the text encoder neural network, to generate a respective contextual embedding of the respective input text; resizing the respective target video of each training example to generate a corresponding initial target output video; and training the initial diffusion-based generative neural network using the contextual embedding and initial target output video of each training example. 4. The method of claim 3 , wherein training the initial diffusion-based generative neural network using the contextual embedding and initial target output video of each training example comprised: processing the respective contextual embedding of each training example, using the initial diffusion-based generative neural network, to generate a respective initial training output video; calculating a gradient of an objective function that depends on the initial training and target output videos of each training example; and updating a set of network parameters of the initial diffusion-based generative neural network according to the gradient of the objective function. 5. The method of claim 3 , wherein: the respective input of each subsequent diffusion-based generative neural network further comprises the contextual embedding of the text prompt, and training each subsequent diffusion-based generative neural network on the training data further comprised: training the subsequent diffusion-based generative neural network using the contextual embedding of each training example. 6. The method of claim 5 , wherein training the subsequent generative neural network using the contextual embedding, training input video, and target output video of each training example comprised: generating, for each training example, a respective input comprising: (i) the respective training input video, and (ii) the respective input contextual embedding; processing the respective input of each training example, using the subsequent diffusion-based generative neural network, to generate a respective training output video; calculating a gradient of an objective function that depends on the training and target output videos of each training example; and updating a set of network parameters of the subsequent diffusion-based generative neural network according to the gradient of the objective function. 7. The method of claim 1 , wherein: the one or more subsequent diffusion-based generative neural networks are a plurality of subsequent diffusion-based generative neural networks, and the respective output video of each of the plurality of subsequent diffusion-based generative neural networks has one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video. 8. The method of claim 7 , wherein: the initial diffusion-based generative neural network implements spatial self-attention and temporal self-attention, and each subsequent diffusion-based generative neural network implements spatial convolution and temporal convolution. 9. The method of claim 8 , wherein: the initial diffusion-based generative neural network further implements spatial convolution, and each subsequent diffusion-based generative neural network that is not a final diffusion-based generative neural network in the sequence further implements spatial self-attention. 10. The method of claim 1 , wherein the diffusion-based generative neural networks were trained using classifier-free guidance. 11. The method of claim 1 , wherein each diffusion-based generative neural network uses a v-prediction parametrization to generate the respective output video. 12. The method of claim 11 , wherein each diffusion-based generative neural network further uses progressive distillation to generate the respective output video. 13. The method of claim 1 , wherein each subsequent diffusion-based generative neural network applies noise conditioning augmentation on the input video. 14. The method of claim 1 , wherein the final video is the respective output video of a final diffusion-based generative neural network in the sequence. 15. The method of claim 1 , wherein the initial spatial resolution of the initial output video corresponds to an initial per frame pixel resolution, with higher spatial resolutions corresponding to higher per frame pixel resolutions. 16. The method of claim 1 , wherein the initial temporal resolution of the initial output video corresponds to an initial framerate, with higher temporal resolutions corresponding to higher framerates. 17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more comput

Assignees

Inventors

Classifications

  • based on super-resolution, i.e. the output image resolution being higher than the sensor resolution · CPC title

  • Combinations of networks · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12277758B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium. In one aspect, a method includes receiving a text prompt describing a scene; processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and processing the contextual embedding using a sequence of generative neural networks to generate a fi…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).