What technology area does this patent fall under?

Primary CPC classification G06T5/60. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Hierarchical patch-wise diffusion models for high-resolution video generation

US12524925B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12524925-B2
Application number	US-202418437790-A
Country	US
Kind code	B2
Filing date	Feb 9, 2024
Priority date	Feb 9, 2024
Publication date	Jan 13, 2026
Grant date	Jan 13, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Hierarchical patch-wise diffusion models (HPDMs) use a diffusion paradigm that learns a hierarchical distribution of patches instead of whole videos for efficient patch-wise training of diffusion models. To enforce consistency between the patches, deep context fusion may be used to propagate the context information from low-scale to high-scale patches in a hierarchical manner. To accelerate patch-wise training and inference, adaptive computation also may be used to allocate more computational resources and network capacity towards coarse image details and to cheapen synthesis of high-frequency texture details. All the processing stages are jointly trained to provide spatially aligned global context to the higher levels of the cascade. As a result, the model does not operate on the full-resolution inputs, which allows the model to be trained on high-resolution video datasets in an end-to-end fashion.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of generating images using a hierarchical patch-wise diffusion model (HPDM), comprising: training the HPDM on a dataset of at least one of videos or images, the HPDM having a hierarchical cascade-like structure including pipeline processing stages and patches that scale to decrease exponentially for each subsequent processing stage, wherein each patch comprises a continuous subgrid of pixel values extracted from an image or video that have a same resolution and include global information including a description of the image or video; sampling the image or video to extract a hierarchy of patches that are provided as input to the HPDM in such a way that a patch is located inside any previous patches so that the previous patches provide context information for the patch in each subsequent processing stage of the HPDM; providing a combination of patches and corresponding noise maps to train the HPDM to denoise all patches jointly; upsampling activations of lower resolution images of each processing stage to generate upsampled patches at higher resolutions relative to a previous processing stage to make the patches of a next processing stage of the HPDM globally coherent in a trained HPDM; and generating a synthesized image or video from a processing stage of the trained HPDM patch by patch. 2 . The method of claim 1 , wherein the sampling comprises grid sampling activations of the image or video with bilinear or trilinear interpolation from any previous processing stages, averaging resulting grid sampled activations, and concatenating averaged grid sampled activations to create spatially aligned features of the image or video. 3 . The method of claim 2 , further comprising providing the spatially aligned features to a processing stage of a recurrent interface network to create an activation tensor representative of normal network features of the image or video. 4 . The method of claim 1 , wherein the sampling comprises sampling patches using hierarchical overlapped sampling whereby the patches are sampled such that coordinates overlap between neighboring patches. 5 . The method of claim 1 , further comprising applying deep context fusion to condition a subsequent processing stage on spatially aligned, globally pooled features of any previous processing stages by pooling context information from previous processing stages into an input of the subsequent processing stage. 6 . The method of claim 1 , wherein training the HPDM to denoise all patches jointly includes exponentially reducing input noise scaling at each subsequent processing stage. 7 . The method of claim 1 , further comprising applying adaptive computation to successive processing stages whereby a subset of processing stages operate on high-resolution patches and low-resolution patches are processed by each processing stage. 8 . The method of claim 7 , wherein applying adaptive computation comprises skipping processing of high-resolution activations in at least one processing stage of the pipeline processing stages. 9 . The method of claim 1 , further comprising caching activations from previous processing stages during inference. 10 . A system for generating images using a trained hierarchical patch-wise diffusion model (HPDM)-mages, comprising: a recurrent interface network (RIN) comprising a linear image tokenizer, followed by a sequence of identical attention-only pipeline processing stages and a linear detokenizer adapted to transform image tokens to red, green, blue (RGB) pixel values; and a processor adapted to provide an input image or video as an input to the RIN during training, the RIN processing the input image or video to create a hierarchical cascade-like structure including processing stages and patches that scale to decrease exponentially for each subsequent processing stage of the RIN, wherein each patch comprises a continuous subgrid of pixel values extracted from the image or video that have a same resolution and include global information including a description of the image or video, the processor further sampling the image or video to extract a hierarchy of patches that are provided as input to subsequent processing stages of the RIN in such a way that a patch is located inside any previous patches so that the previous patches provide context information for the patch in each subsequent processing stage of the RIN, wherein the RIN provides a combination of patches and corresponding noise maps to denoise all patches jointly, upsamples activations of lower resolution images of each processing stage to generate upsampled patches at higher resolutions relative to a previous processing stage to make the patches of a next processing stage of the RIN globally coherent, and generates a synthesized image or video from a processing stage of the RIN patch by patch. 11 . The system of claim 10 , wherein the processor grid samples activations of the image or video with bilinear or trilinear interpolation from any previous processing stages, averages resulting grid sampled activations, and concatenates averaged grid sampled activations to create spatially aligned features of the image or video. 12 . The system of claim 11 , wherein the processor provides the spatially aligned features to a processing stage of the RIN to create an activation tensor representative of normal network features of the image or video. 13 . The system of claim 10 , wherein the processor samples patches using hierarchical overlapped sampling whereby the patches are sampled such that coordinates overlap between neighboring patches. 14 . The system of claim 10 , wherein the RIN applies deep context fusion to condition a subsequent processing stage of the RIN on spatially aligned, globally pooled features of any previous processing stages by pooling context information from previous processing stages into an input of the subsequent processing stage. 15 . The system of claim 10 , wherein the RIN denoises all patches jointly during training by exponentially reducing input noise scaling at each subsequent processing stage of the pipeline processing stages. 16 . The system of claim 10 , wherein the RIN applies adaptive computation to successive processing stages whereby a subset of processing stages operate on high-resolution patches and low-resolution patches are processed by each processing stage. 17 . The system of claim 16 , wherein the RIN skips processing of high-resolution activations in at least one processing stage of the pipeline processing stages. 18 . The system of claim 10 , wherein the RIN caches activations from previous processing stages during inference. 19 . A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a processor cause the processor to implement a method of generating images using a hierarchical patch-wise diffusion model (HPDM), by performing operations comprising: training the HPDM on a dataset of at least one of videos or images, the HPDM having a hierarchical cascade-like structure including pipeline processing stages and patches that scale to decrease exponentially for each subsequent processing stage, wherein each patch comprises a continuous subgrid of pixel values extracted from an image or video that have a same resolution and include global information including a description of the image or video; sampling the image or video to extract a hierarchy of patches that are provided as input to the HPDM in such a way that a patch is lo

Assignees

Snap Inc

Inventors

Classifications

G06T2207/20084
Artificial neural networks [ANN] · CPC title
G06T2207/20081
Training; Learning · CPC title
G06T2207/20016
Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform · CPC title
G06T2207/10016
Video; Image sequence · CPC title
G06T3/4046
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 94772083

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12524925B2 cover?: Hierarchical patch-wise diffusion models (HPDMs) use a diffusion paradigm that learns a hierarchical distribution of patches instead of whole videos for efficient patch-wise training of diffusion models. To enforce consistency between the patches, deep context fusion may be used to propagate the context information from low-scale to high-scale patches in a hierarchical manner. To accelerate pat…
Who is the assignee on this patent?: Snap Inc
What technology area does this patent fall under?: Primary CPC classification G06T5/60. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).