What technology area does this patent fall under?

Primary CPC classification G06V10/82. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generating videos using sequences of generative neural networks

US12277758B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12277758-B2
Application number	US-202318400856-A
Country	US
Kind code	B2
Filing date	Dec 29, 2023
Priority date	Mar 24, 2023
Publication date	Apr 15, 2025
Grant date	Apr 15, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium. In one aspect, a method includes receiving a text prompt describing a scene; processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and processing the contextual embedding using a sequence of generative neural networks to generate a final video depicting the scene.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving a text prompt describing a scene; processing the text prompt, using a text encoder neural network, to generate a contextual embedding of the text prompt; and processing the contextual embedding, using a sequence of diffusion-based generative neural networks, to generate a final video depicting the scene, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embedding; and process the contextual embedding to generate, as output, an initial output video having: (i) an initial spatial resolution, and (ii) an initial temporal resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising an input video generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output video having at least one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video, wherein the diffusion-based generative neural networks have been jointly trained on training data comprising a plurality of training examples that each include: (i) respective input text describing a respective scene, and (ii) a corresponding target video depicting the respective scene, wherein the text encoder neural network is pre-trained and was held frozen during the joint training of the diffusion-based generated neural networks, and wherein training each subsequent diffusion-based generative neural network on the training data comprised: obtaining a resized version of the respective target video of each training example to generate: (i) a respective training input video, and (ii) a corresponding target output video; and training the subsequent diffusion-based generative neural network using the training input and target output videos of each training example. 2. The method of claim 1 , wherein training the subsequent diffusion-based generative neural network using the training input and target output videos of each training example comprised: processing the respective training input video of each training example, using the subsequent diffusion-based generative neural network, to generate a respective training output video; calculating a gradient of an objective function that depends on the training and target output videos of each training example; and updating a set of network parameters of the subsequent diffusion-based generative neural network according to the gradient of the objective function. 3. The method of claim 1 , wherein training the initial diffusion-based generative neural network on the training data comprised: processing the respective input text of each training example, using the text encoder neural network, to generate a respective contextual embedding of the respective input text; resizing the respective target video of each training example to generate a corresponding initial target output video; and training the initial diffusion-based generative neural network using the contextual embedding and initial target output video of each training example. 4. The method of claim 3 , wherein training the initial diffusion-based generative neural network using the contextual embedding and initial target output video of each training example comprised: processing the respective contextual embedding of each training example, using the initial diffusion-based generative neural network, to generate a respective initial training output video; calculating a gradient of an objective function that depends on the initial training and target output videos of each training example; and updating a set of network parameters of the initial diffusion-based generative neural network according to the gradient of the objective function. 5. The method of claim 3 , wherein: the respective input of each subsequent diffusion-based generative neural network further comprises the contextual embedding of the text prompt, and training each subsequent diffusion-based generative neural network on the training data further comprised: training the subsequent diffusion-based generative neural network using the contextual embedding of each training example. 6. The method of claim 5 , wherein training the subsequent generative neural network using the contextual embedding, training input video, and target output video of each training example comprised: generating, for each training example, a respective input comprising: (i) the respective training input video, and (ii) the respective input contextual embedding; processing the respective input of each training example, using the subsequent diffusion-based generative neural network, to generate a respective training output video; calculating a gradient of an objective function that depends on the training and target output videos of each training example; and updating a set of network parameters of the subsequent diffusion-based generative neural network according to the gradient of the objective function. 7. The method of claim 1 , wherein: the one or more subsequent diffusion-based generative neural networks are a plurality of subsequent diffusion-based generative neural networks, and the respective output video of each of the plurality of subsequent diffusion-based generative neural networks has one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video. 8. The method of claim 7 , wherein: the initial diffusion-based generative neural network implements spatial self-attention and temporal self-attention, and each subsequent diffusion-based generative neural network implements spatial convolution and temporal convolution. 9. The method of claim 8 , wherein: the initial diffusion-based generative neural network further implements spatial convolution, and each subsequent diffusion-based generative neural network that is not a final diffusion-based generative neural network in the sequence further implements spatial self-attention. 10. The method of claim 1 , wherein the diffusion-based generative neural networks were trained using classifier-free guidance. 11. The method of claim 1 , wherein each diffusion-based generative neural network uses a v-prediction parametrization to generate the respective output video. 12. The method of claim 11 , wherein each diffusion-based generative neural network further uses progressive distillation to generate the respective output video. 13. The method of claim 1 , wherein each subsequent diffusion-based generative neural network applies noise conditioning augmentation on the input video. 14. The method of claim 1 , wherein the final video is the respective output video of a final diffusion-based generative neural network in the sequence. 15. The method of claim 1 , wherein the initial spatial resolution of the initial output video corresponds to an initial per frame pixel resolution, with higher spatial resolutions corresponding to higher per frame pixel resolutions. 16. The method of claim 1 , wherein the initial temporal resolution of the initial output video corresponds to an initial framerate, with higher temporal resolutions corresponding to higher framerates. 17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more comput

Assignees

Google Llc

Inventors

Classifications

G06T3/4053
based on super-resolution, i.e. the output image resolution being higher than the sensor resolution · CPC title
G06N3/045
Combinations of networks · CPC title
G06V10/82Primary
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 89908411

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12277758B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium. In one aspect, a method includes receiving a text prompt describing a scene; processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and processing the contextual embedding using a sequence of generative neural networks to generate a fi…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Technique for sensor data based medical examination report generation

Image generation using one or more neural networks

Controllable conditional image generation

Frequently asked questions