Image processing method and image processing apparatus
US-12169910-B2 · Dec 17, 2024 · US
US12524925B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12524925-B2 |
| Application number | US-202418437790-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 9, 2024 |
| Priority date | Feb 9, 2024 |
| Publication date | Jan 13, 2026 |
| Grant date | Jan 13, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Hierarchical patch-wise diffusion models (HPDMs) use a diffusion paradigm that learns a hierarchical distribution of patches instead of whole videos for efficient patch-wise training of diffusion models. To enforce consistency between the patches, deep context fusion may be used to propagate the context information from low-scale to high-scale patches in a hierarchical manner. To accelerate patch-wise training and inference, adaptive computation also may be used to allocate more computational resources and network capacity towards coarse image details and to cheapen synthesis of high-frequency texture details. All the processing stages are jointly trained to provide spatially aligned global context to the higher levels of the cascade. As a result, the model does not operate on the full-resolution inputs, which allows the model to be trained on high-resolution video datasets in an end-to-end fashion.
Opening claim text (preview).
What is claimed is: 1 . A method of generating images using a hierarchical patch-wise diffusion model (HPDM), comprising: training the HPDM on a dataset of at least one of videos or images, the HPDM having a hierarchical cascade-like structure including pipeline processing stages and patches that scale to decrease exponentially for each subsequent processing stage, wherein each patch comprises a continuous subgrid of pixel values extracted from an image or video that have a same resolution and include global information including a description of the image or video; sampling the image or video to extract a hierarchy of patches that are provided as input to the HPDM in such a way that a patch is located inside any previous patches so that the previous patches provide context information for the patch in each subsequent processing stage of the HPDM; providing a combination of patches and corresponding noise maps to train the HPDM to denoise all patches jointly; upsampling activations of lower resolution images of each processing stage to generate upsampled patches at higher resolutions relative to a previous processing stage to make the patches of a next processing stage of the HPDM globally coherent in a trained HPDM; and generating a synthesized image or video from a processing stage of the trained HPDM patch by patch. 2 . The method of claim 1 , wherein the sampling comprises grid sampling activations of the image or video with bilinear or trilinear interpolation from any previous processing stages, averaging resulting grid sampled activations, and concatenating averaged grid sampled activations to create spatially aligned features of the image or video. 3 . The method of claim 2 , further comprising providing the spatially aligned features to a processing stage of a recurrent interface network to create an activation tensor representative of normal network features of the image or video. 4 . The method of claim 1 , wherein the sampling comprises sampling patches using hierarchical overlapped sampling whereby the patches are sampled such that coordinates overlap between neighboring patches. 5 . The method of claim 1 , further comprising applying deep context fusion to condition a subsequent processing stage on spatially aligned, globally pooled features of any previous processing stages by pooling context information from previous processing stages into an input of the subsequent processing stage. 6 . The method of claim 1 , wherein training the HPDM to denoise all patches jointly includes exponentially reducing input noise scaling at each subsequent processing stage. 7 . The method of claim 1 , further comprising applying adaptive computation to successive processing stages whereby a subset of processing stages operate on high-resolution patches and low-resolution patches are processed by each processing stage. 8 . The method of claim 7 , wherein applying adaptive computation comprises skipping processing of high-resolution activations in at least one processing stage of the pipeline processing stages. 9 . The method of claim 1 , further comprising caching activations from previous processing stages during inference. 10 . A system for generating images using a trained hierarchical patch-wise diffusion model (HPDM)-mages, comprising: a recurrent interface network (RIN) comprising a linear image tokenizer, followed by a sequence of identical attention-only pipeline processing stages and a linear detokenizer adapted to transform image tokens to red, green, blue (RGB) pixel values; and a processor adapted to provide an input image or video as an input to the RIN during training, the RIN processing the input image or video to create a hierarchical cascade-like structure including processing stages and patches that scale to decrease exponentially for each subsequent processing stage of the RIN, wherein each patch comprises a continuous subgrid of pixel values extracted from the image or video that have a same resolution and include global information including a description of the image or video, the processor further sampling the image or video to extract a hierarchy of patches that are provided as input to subsequent processing stages of the RIN in such a way that a patch is located inside any previous patches so that the previous patches provide context information for the patch in each subsequent processing stage of the RIN, wherein the RIN provides a combination of patches and corresponding noise maps to denoise all patches jointly, upsamples activations of lower resolution images of each processing stage to generate upsampled patches at higher resolutions relative to a previous processing stage to make the patches of a next processing stage of the RIN globally coherent, and generates a synthesized image or video from a processing stage of the RIN patch by patch. 11 . The system of claim 10 , wherein the processor grid samples activations of the image or video with bilinear or trilinear interpolation from any previous processing stages, averages resulting grid sampled activations, and concatenates averaged grid sampled activations to create spatially aligned features of the image or video. 12 . The system of claim 11 , wherein the processor provides the spatially aligned features to a processing stage of the RIN to create an activation tensor representative of normal network features of the image or video. 13 . The system of claim 10 , wherein the processor samples patches using hierarchical overlapped sampling whereby the patches are sampled such that coordinates overlap between neighboring patches. 14 . The system of claim 10 , wherein the RIN applies deep context fusion to condition a subsequent processing stage of the RIN on spatially aligned, globally pooled features of any previous processing stages by pooling context information from previous processing stages into an input of the subsequent processing stage. 15 . The system of claim 10 , wherein the RIN denoises all patches jointly during training by exponentially reducing input noise scaling at each subsequent processing stage of the pipeline processing stages. 16 . The system of claim 10 , wherein the RIN applies adaptive computation to successive processing stages whereby a subset of processing stages operate on high-resolution patches and low-resolution patches are processed by each processing stage. 17 . The system of claim 16 , wherein the RIN skips processing of high-resolution activations in at least one processing stage of the pipeline processing stages. 18 . The system of claim 10 , wherein the RIN caches activations from previous processing stages during inference. 19 . A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a processor cause the processor to implement a method of generating images using a hierarchical patch-wise diffusion model (HPDM), by performing operations comprising: training the HPDM on a dataset of at least one of videos or images, the HPDM having a hierarchical cascade-like structure including pipeline processing stages and patches that scale to decrease exponentially for each subsequent processing stage, wherein each patch comprises a continuous subgrid of pixel values extracted from an image or video that have a same resolution and include global information including a description of the image or video; sampling the image or video to extract a hierarchy of patches that are provided as input to the HPDM in such a way that a patch is lo
Artificial neural networks [ANN] · CPC title
Training; Learning · CPC title
Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform · CPC title
Video; Image sequence · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.