Utilizing regularized forward diffusion for improved inversion of digital images
US-2024338799-A1 · Oct 10, 2024 · US
US12333636B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12333636-B2 |
| Application number | US-202318178194-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 3, 2023 |
| Priority date | Mar 3, 2023 |
| Publication date | Jun 17, 2025 |
| Grant date | Jun 17, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure relates to systems, non-transitory computer-readable media, and methods for utilizing machine learning models to generate modified digital images. In particular, in some embodiments, the disclosed systems generate image editing directions between textual identifiers of two visual features utilizing a language prediction machine learning model and a text encoder. In some embodiments, the disclosed systems generated an inversion of a digital image utilizing a regularized inversion model to guide forward diffusion of the digital image. In some embodiments, the disclosed systems utilize cross-attention guidance to preserve structural details of a source digital image when generating a modified digital image with a diffusion neural network.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: generating a reference cross-attention map between a reference encoding of a source digital image and an intermediate image reconstruction prediction generated utilizing a reconstruction denoising layer of a diffusion neural network; generating an editing cross-attention map between an image editing encoding and an intermediate edited image prediction generated utilizing an image editing denoising layer of the diffusion neural network; and generating a modified digital image, utilizing the diffusion neural network, by updating the intermediate edited image prediction to reduce a cross-attention loss determined by comparing the editing cross-attention map and the reference cross-attention map. 2. The computer-implemented method of claim 1 , further comprising: generating an inversion of the source digital image utilizing diffusion layers of the diffusion neural network; and generating, utilizing the reconstruction denoising layer of the diffusion neural network, the intermediate image reconstruction prediction by denoising the inversion of the source digital image utilizing the reference encoding. 3. The computer-implemented method of claim 2 , wherein generating the inversion of the source digital image utilizing the diffusion layers of the diffusion neural network comprises iteratively generating a plurality of subsequent noise maps from the source digital image based on an auto-correlation regularization loss. 4. The computer-implemented method of claim 2 , further comprising generating, utilizing the image editing denoising layer of the diffusion neural network, the intermediate edited image prediction by denoising the inversion of the source digital image utilizing the image editing encoding. 5. The computer-implemented method of claim 1 , further comprising generating, utilizing a text encoder, the reference encoding from an image caption describing the source digital image. 6. The computer-implemented method of claim 5 , wherein generating the reference encoding comprises generating the image caption for the source digital image utilizing a vision-language machine learning model. 7. The computer-implemented method of claim 5 , further comprising generating the image editing encoding by combining the reference encoding and an image editing direction. 8. The computer-implemented method of claim 1 , wherein updating the intermediate edited image prediction to reduce the cross-attention loss comprises reducing a difference between the editing cross-attention map and the reference cross-attention map. 9. A system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices that cause the system to perform operations comprising: generating, utilizing a reconstruction denoising layer of a diffusion neural network, an intermediate image reconstruction prediction from an inversion of a source digital image; creating a reference cross-attention map between a reference encoding of the source digital image and the intermediate image reconstruction prediction; generating, utilizing an image editing denoising layer of the diffusion neural network, an intermediate edited image prediction from the inversion and an image editing encoding; creating an editing cross-attention map between the image editing encoding and the intermediate edited image prediction; and generating a modified digital image, utilizing the diffusion neural network, by updating the intermediate edited image prediction to reduce a cross-attention loss determined by comparing the editing cross-attention map and the reference cross-attention map. 10. The system of claim 9 , wherein generating the intermediate image reconstruction prediction comprises denoising the inversion conditioned on the reference encoding utilizing a conditioning mechanism with the reconstruction denoising layer of the diffusion neural network. 11. The system of claim 9 , wherein the operations further comprise: generating, utilizing a language prediction machine learning model, an embedded image editing direction between a source visual feature portrayed within the source digital image and a target visual feature; and generating the image editing encoding based on the embedded image editing direction. 12. The system of claim 11 , wherein the operations further comprise: generating, utilizing a text encoder, the reference encoding from an image caption describing the source digital image; and generating the image editing encoding by combining the embedded image editing direction with the reference encoding. 13. The system of claim 9 , wherein generating the modified digital image comprises: determining the cross-attention loss based on a difference between the editing cross-attention map and the reference cross-attention map; generating a modified intermediate edited image prediction by modifying the intermediate edited image prediction to reduce the cross-attention loss; and generating the modified digital image from the modified intermediate edited image prediction utilizing additional denoising layers of the diffusion neural network. 14. The system of claim 13 , wherein generating the modified digital image comprises: generating, utilizing an additional image editing denoising layer of the additional denoising layers, an additional intermediate edited image prediction from the modified intermediate edited image prediction; and generating, utilizing an additional reconstruction denoising layer of the additional denoising layers, an additional intermediate reconstruction prediction from the intermediate image reconstruction prediction. 15. The system of claim 14 , wherein generating the modified digital image comprises: creating an additional reference cross-attention map between the reference encoding and the additional intermediate image reconstruction prediction; creating an additional editing cross-attention map between the image editing encoding and the additional intermediate edited image prediction; and generating the modified digital image by modifying the additional intermediate edited image prediction by comparing the additional editing cross-attention map and the additional reference cross-attention map. 16. A non-transitory computer readable medium storing instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising: generating a reference cross-attention map between a reference encoding of a source digital image and an intermediate image reconstruction prediction generated utilizing a reconstruction denoising layer of a diffusion neural network; generating an editing cross-attention map between an image editing encoding and an intermediate edited image prediction generated utilizing an image editing denoising layer of the diffusion neural network; and generating a modified digital image, utilizing the diffusion neural network, by updating the intermediate edited image prediction to reduce a cross-attention loss determined by comparing the editing cross-attention map and the reference cross-attention map. 17. The non-transitory computer readable medium of claim 16 , wherein the operations further comprise: generating, utilizing the reconstruction denoising layer of the diffusion neural network, the intermediate image reconstruction prediction from an inversion of the source digital image; and generating, utilizing the image editing denoising layer of the diffusion neural network, the intermediate edited image predict
Denoising; Smoothing · CPC title
Proximity, similarity or dissimilarity measures · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
using neural networks · CPC title
Noise reduction or smoothing in the temporal domain; Spatio-temporal filtering · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.