Utilizing cross-attention guidance to preserve content in diffusion-based image modifications

US12333636B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12333636-B2
Application numberUS-202318178194-A
CountryUS
Kind codeB2
Filing dateMar 3, 2023
Priority dateMar 3, 2023
Publication dateJun 17, 2025
Grant dateJun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to systems, non-transitory computer-readable media, and methods for utilizing machine learning models to generate modified digital images. In particular, in some embodiments, the disclosed systems generate image editing directions between textual identifiers of two visual features utilizing a language prediction machine learning model and a text encoder. In some embodiments, the disclosed systems generated an inversion of a digital image utilizing a regularized inversion model to guide forward diffusion of the digital image. In some embodiments, the disclosed systems utilize cross-attention guidance to preserve structural details of a source digital image when generating a modified digital image with a diffusion neural network.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: generating a reference cross-attention map between a reference encoding of a source digital image and an intermediate image reconstruction prediction generated utilizing a reconstruction denoising layer of a diffusion neural network; generating an editing cross-attention map between an image editing encoding and an intermediate edited image prediction generated utilizing an image editing denoising layer of the diffusion neural network; and generating a modified digital image, utilizing the diffusion neural network, by updating the intermediate edited image prediction to reduce a cross-attention loss determined by comparing the editing cross-attention map and the reference cross-attention map. 2. The computer-implemented method of claim 1 , further comprising: generating an inversion of the source digital image utilizing diffusion layers of the diffusion neural network; and generating, utilizing the reconstruction denoising layer of the diffusion neural network, the intermediate image reconstruction prediction by denoising the inversion of the source digital image utilizing the reference encoding. 3. The computer-implemented method of claim 2 , wherein generating the inversion of the source digital image utilizing the diffusion layers of the diffusion neural network comprises iteratively generating a plurality of subsequent noise maps from the source digital image based on an auto-correlation regularization loss. 4. The computer-implemented method of claim 2 , further comprising generating, utilizing the image editing denoising layer of the diffusion neural network, the intermediate edited image prediction by denoising the inversion of the source digital image utilizing the image editing encoding. 5. The computer-implemented method of claim 1 , further comprising generating, utilizing a text encoder, the reference encoding from an image caption describing the source digital image. 6. The computer-implemented method of claim 5 , wherein generating the reference encoding comprises generating the image caption for the source digital image utilizing a vision-language machine learning model. 7. The computer-implemented method of claim 5 , further comprising generating the image editing encoding by combining the reference encoding and an image editing direction. 8. The computer-implemented method of claim 1 , wherein updating the intermediate edited image prediction to reduce the cross-attention loss comprises reducing a difference between the editing cross-attention map and the reference cross-attention map. 9. A system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices that cause the system to perform operations comprising: generating, utilizing a reconstruction denoising layer of a diffusion neural network, an intermediate image reconstruction prediction from an inversion of a source digital image; creating a reference cross-attention map between a reference encoding of the source digital image and the intermediate image reconstruction prediction; generating, utilizing an image editing denoising layer of the diffusion neural network, an intermediate edited image prediction from the inversion and an image editing encoding; creating an editing cross-attention map between the image editing encoding and the intermediate edited image prediction; and generating a modified digital image, utilizing the diffusion neural network, by updating the intermediate edited image prediction to reduce a cross-attention loss determined by comparing the editing cross-attention map and the reference cross-attention map. 10. The system of claim 9 , wherein generating the intermediate image reconstruction prediction comprises denoising the inversion conditioned on the reference encoding utilizing a conditioning mechanism with the reconstruction denoising layer of the diffusion neural network. 11. The system of claim 9 , wherein the operations further comprise: generating, utilizing a language prediction machine learning model, an embedded image editing direction between a source visual feature portrayed within the source digital image and a target visual feature; and generating the image editing encoding based on the embedded image editing direction. 12. The system of claim 11 , wherein the operations further comprise: generating, utilizing a text encoder, the reference encoding from an image caption describing the source digital image; and generating the image editing encoding by combining the embedded image editing direction with the reference encoding. 13. The system of claim 9 , wherein generating the modified digital image comprises: determining the cross-attention loss based on a difference between the editing cross-attention map and the reference cross-attention map; generating a modified intermediate edited image prediction by modifying the intermediate edited image prediction to reduce the cross-attention loss; and generating the modified digital image from the modified intermediate edited image prediction utilizing additional denoising layers of the diffusion neural network. 14. The system of claim 13 , wherein generating the modified digital image comprises: generating, utilizing an additional image editing denoising layer of the additional denoising layers, an additional intermediate edited image prediction from the modified intermediate edited image prediction; and generating, utilizing an additional reconstruction denoising layer of the additional denoising layers, an additional intermediate reconstruction prediction from the intermediate image reconstruction prediction. 15. The system of claim 14 , wherein generating the modified digital image comprises: creating an additional reference cross-attention map between the reference encoding and the additional intermediate image reconstruction prediction; creating an additional editing cross-attention map between the image editing encoding and the additional intermediate edited image prediction; and generating the modified digital image by modifying the additional intermediate edited image prediction by comparing the additional editing cross-attention map and the additional reference cross-attention map. 16. A non-transitory computer readable medium storing instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising: generating a reference cross-attention map between a reference encoding of a source digital image and an intermediate image reconstruction prediction generated utilizing a reconstruction denoising layer of a diffusion neural network; generating an editing cross-attention map between an image editing encoding and an intermediate edited image prediction generated utilizing an image editing denoising layer of the diffusion neural network; and generating a modified digital image, utilizing the diffusion neural network, by updating the intermediate edited image prediction to reduce a cross-attention loss determined by comparing the editing cross-attention map and the reference cross-attention map. 17. The non-transitory computer readable medium of claim 16 , wherein the operations further comprise: generating, utilizing the reconstruction denoising layer of the diffusion neural network, the intermediate image reconstruction prediction from an inversion of the source digital image; and generating, utilizing the image editing denoising layer of the diffusion neural network, the intermediate edited image predict

Assignees

Inventors

Classifications

  • Denoising; Smoothing · CPC title

  • Proximity, similarity or dissimilarity measures · CPC title

  • Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • using neural networks · CPC title

  • Noise reduction or smoothing in the temporal domain; Spatio-temporal filtering · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12333636B2 cover?
The present disclosure relates to systems, non-transitory computer-readable media, and methods for utilizing machine learning models to generate modified digital images. In particular, in some embodiments, the disclosed systems generate image editing directions between textual identifiers of two visual features utilizing a language prediction machine learning model and a text encoder. In some e…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06T11/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).