Generating videos using diffusion models

US12555367B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12555367-B2
Application numberUS-202318296938-A
CountryUS
Kind codeB2
Filing dateApr 6, 2023
Priority dateApr 6, 2023
Publication dateFeb 17, 2026
Grant dateFeb 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating an output video conditioned on an input. In one aspect, a method comprises receiving the input; initializing a current intermediate representation; generating an output video by updating the current intermediate representation at each of a plurality of iterations, wherein the updating comprises, at each iteration: processing an intermediate input for the iteration comprising the current intermediate representation using a diffusion model that is configured to process the intermediate input to generate a noise output; and updating the current intermediate representation using the noise output for the iteration.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of generating an output video conditioned on an input, the method comprising: receiving the input, wherein the input comprises a plurality of video frames that each have a respective plurality of pixels, wherein each pixel in each video frame has one or more intensity values, and wherein at least a subset of the intensity values for the pixels in the video frames are sampled from a noise distribution; initializing a current intermediate representation based on the input; generating an output video by updating the current intermediate representation at each of a plurality of iterations, wherein the updating comprises, at each iteration: processing an intermediate input for the iteration comprising the current intermediate representation using a diffusion model that is configured to process the intermediate input to generate a noise output; and updating the current intermediate representation using the noise output for the iteration, comprising: generating a prediction of the output video using the current intermediate representation and the noise output; and for each iteration other than the last iteration: applying a diffusion sampler to the current intermediate representation and the prediction of the output video; and updating the current intermediate representation by applying the diffusion sampler. 2 . The method of claim 1 , wherein each iteration corresponds to a respective noise level, and wherein the intermediate input for the iteration further comprises the noise level for the iteration. 3 . The method of claim 1 , wherein applying the diffusion sampler comprises using a discrete time ancestral sampler. 4 . The method of claim 1 , wherein applying the diffusion sampler comprises alternating between an ancestral sampler step and a Langevin correction step. 5 . The method of claim 1 , wherein the input further comprises a conditioning signal and wherein the diffusion model is a conditional diffusion model that is conditioned on the conditioning signal. 6 . The method of claim 5 , wherein updating the current intermediate representation using the noise output for the iteration comprises generating a prediction of the output video using the current intermediate representation and the noise output, and wherein for each iteration other than the last iteration, updating the current intermediate representation using the noise output for the iteration further comprises: applying a diffusion sampler to the current intermediate representation and the prediction of the output video, wherein applying the diffusion sampler comprises alternating between an adjusted ancestral sampler step and an adjusted Langevin correction step. 7 . The method of claim 6 , wherein the output video is a longer video conditioned on the video frames in the input, wherein updating the current intermediate representation using the noise output for the iteration comprises generating an adjusted prediction of the output video that is adjusted by guidance from the video frames in the input. 8 . The method of claim 6 , wherein the output video is a video with a higher frame rate conditioned on the video frames in the input, wherein updating the current intermediate representation using the noise output for the iteration comprises generating an adjusted prediction of the output video that is adjusted by guidance from the video frames in the input. 9 . The method of claim 6 , wherein the output video is a higher resolution video conditioned on the video frames in the input, wherein updating the current intermediate representation using the noise output for the iteration comprises generating an adjusted prediction of the output video that is adjusted to account for the video frames in the input. 10 . The method of claim 1 , wherein the diffusion model is configured to process the current intermediate representation through a sequence comprising a plurality of convolutional network blocks to generate the noise output. 11 . The method of claim 10 , wherein the plurality of convolutional network blocks comprise: one or more downsampling blocks that each downsample an input to the downsampling block at each of a plurality of downsampling iterations, wherein the one or more downsampling blocks are followed in the sequence by one or more upsampling blocks that each upsample an input to the upsampling blocks at each of a plurality of upsampling iterations. 12 . The method of claim 11 , wherein the input to the downsampling block at a first downsampling iteration of the plurality of downsampling iterations comprises the current intermediate representation, and wherein the noise output comprises an output of the upsampling block at a last upsampling iteration of the plurality of upsampling iterations. 13 . The method of claim 10 , wherein the sequence further comprises network blocks that perform attention. 14 . The method of claim 10 , wherein each convolutional network block is configured to apply a space-only three-dimensional convolution so that a first axis indexes video frames, a second axis indexes a spatial height, and a third axis indexes a spatial width. 15 . The method of claim 11 , wherein at each of the plurality of upsampling iterations, the diffusion model is configured to perform operations comprising: maintaining, for the upsampling block for the upsampling iteration, a feature map from a corresponding downsampling block; applying a spatial attention block over the feature map to generate a spatial attention feature map; applying a temporal attention block over the spatial attention feature map to generate a spatial temporal attention feature map; and applying the spatial temporal attention feature map to the output of the upsampling block. 16 . The method of claim 15 , wherein applying a spatial attention block over the feature map to generate a spatial attention feature map comprises applying spatial attention over the values within each video frame. 17 . The method of claim 15 , wherein applying a temporal attention block over the spatial attention feature map to generate a spatial temporal attention feature map comprises applying temporal attention over corresponding patches across the video frames. 18 . A method for training a diffusion model configured to process an intermediate input to generate a noise output, the method comprising repeatedly performing the following operations: obtaining a training example from training data as a ground-truth training output; adding noise to the training example to create a training input; generating a training noise output from the training input by processing an intermediate input including the training input using the diffusion model in accordance with current values of the parameters of the diffusion model, wherein the diffusion model comprises a sequence comprising a plurality of convolutional network blocks, and wherein the plurality of convolutional network blocks comprise one or more downsampling blocks that each downsample an input to the downsampling block at each of a plurality of downsampling iterations, wherein the one or more downsampling blocks are followed in the sequence by one or more upsampling blocks that each upsample an input to the upsampling blocks at each of a plurality of upsampling iterations; and determining updates to the parameters of the diffusion model that optimize a loss function. 19 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by t

Assignees

Inventors

Classifications

  • G06V10/771Primary

    Feature selection, e.g. selecting representative features from a multi-dimensional feature space · CPC title

  • involving conversion of the spatial resolution of the incoming video signal (for graphics images G09G2340/0407) · CPC title

  • the incoming video signal comprising different parts having originally different frame rate, e.g. video and graphics · CPC title

  • Combinations of networks · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12555367B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating an output video conditioned on an input. In one aspect, a method comprises receiving the input; initializing a current intermediate representation; generating an output video by updating the current intermediate representation at each of a plurality of iterations, wherein the updating …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/771. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).