Digital video editing based on a target digital image

US2025265752A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025265752-A1
Application numberUS-202418583067-A
CountryUS
Kind codeA1
Filing dateFeb 21, 2024
Priority dateFeb 21, 2024
Publication dateAug 21, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Digital video editing techniques are described that are based on a target digital image. In one or more implementations, inputs are received. The inputs include a target text prompt, a target digital image depicting a target object, and a source digital video having a plurality of frames depicting a source object. Regions-of-interest are identified in the plurality of frames of the source digital video, respectively, based on the target text prompt and the target digital image using a machine-learning model, e.g., a diffusion model. A plurality of frames of a target digital video are generated as having the target object using a generative machine-learning model. The generating is based on the regions-of-interest, the target digital image, the source digital video, and a source text prompt describing the source digital video.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving, by a processing device, a target text prompt, a target digital image depicting a target object, and a source digital video having a plurality of frames depicting a source object; identifying, by the processing device, regions-of-interest in the plurality of frames of the source digital video, respectively, based on the target text prompt and the target digital image using a machine-learning model; generating, by the processing device, a plurality of frames of a target digital video having the target object using a generative machine-learning model, the generating based on the regions-of-interest, the target digital image, the source digital video, and a source text prompt describing the source digital video; and outputting, by the processing device, the target digital video. 2 . The method as described in claim 1 , wherein the plurality of frames of the target digital video depicts the target object as following motion exhibited by the source object in the source digital video. 3 . The method as described in claim 1 , wherein the identifying the regions-of-interest includes forming a plurality of masks defining, respectively, the regions-of-interest. 4 . The method as described in claim 3 , wherein the forming the plurality of masks is based, at least in part, on the target text prompt and the target digital image. 5 . The method as described in claim 1 , wherein the machine-learning model, utilized to perform the identifying of the regions-of-interest, is configured as one or more diffusion models. 6 . The method as described in claim 5 , wherein the one or more diffusion models include: a source denoising branch configured to process the source text prompt; and a target denoising branch configured to process the target text prompt and the target object of the target digital image. 7 . The method as described in claim 6 , wherein the identifying includes comparing noise differences as a reconstruction loss across respective timesteps between the source denoising branch and the target denoising branch. 8 . The method as described in claim 7 , wherein the identifying further comprises averaging and binarizing the noise differences to form a plurality of masks defining, respectively, the regions-of-interest. 9 . The method as described in claim 1 , wherein the generative machine-learning model, utilized to generate the plurality of frames, is configured as one or more diffusion models. 10 . The method as described in claim 1 , wherein the generating of the plurality of frames of the target digital video includes calculating a latent correction during inference involving inter-frame temporal consistency. 11 . The method as described in claim 10 , wherein the calculating includes computing inter-frame latent fields by mapping spatial locations of features between the plurality of frames of the target digital video. 12 . The method as described in claim 11 , further comprising blending the computed inter-frame latent fields at a plurality of timesteps corresponding to the plurality of frames of the target digital video. 13 . The method as described in claim 1 , wherein the generating of the plurality of frames of the target digital video includes preserving a background of the source digital video by correcting latent noise corresponding to the background based on the regions-of-interest. 14 . A computing device comprising: a processing device; and a computer-readable storage medium storing instructions that, in response to execution by the processing device, causes the processing device to perform operations including: receiving a target text prompt, a target digital image depicting a target object, a source digital video having a plurality of frames depicting a source object, and a source text prompt describing the source digital video; generating a plurality of masks defining regions-of-interest in the plurality of frames of the source digital video using a machine-learning model, the generating based on the source digital video, the target object, the target text prompt, and the source text prompt; and generating a plurality of frames of a target digital video having the target object as following motion of the source object using a generative machine-learning model based on the plurality of masks. 15 . The computing device as described in claim 14 , wherein the machine-learning model utilized to perform the generating of the plurality of masks is configured as one or more diffusion models. 16 . The computing device as described in claim 15 , wherein the generating of the plurality of masks includes comparing noise differences across respective timesteps between: a source denoising branch of the one or more diffusion models configured to process the source text prompt and frames from the source digital video; and a target denoising branch of the one or more diffusion models configured to process the target text prompt and the target object of the target digital image. 17 . The computing device as described in claim 14 , wherein the generating the plurality of frames of the target digital video is performed using a generative machine-learning model based on the regions-of-interest, the target digital image, the source digital video, and the source text prompt describing the source digital video. 18 . One or more computer-readable storage media storing instructions that, in response to execution by a processing device, causes the processing device to perform operations comprising: receiving a target text prompt, a target digital image depicting a target object, a source digital video having a plurality of frames depicting a source object, and a source text prompt describing the source digital video; generating a plurality of masks defining regions-of-interest in the plurality of frames of the source digital video; and generating a plurality of frames of a target digital video having the target object using a generative machine-learning model, the generating based on the regions-of-interest, the target digital image, the source digital video, and a source text prompt describing the source digital video. 19 . The one or more computer-readable storage media as described in claim 18 , wherein the generating a plurality of masks is performed using one or more diffusion models by comparing noise differences across respective timesteps between: a source denoising branch of the one or more diffusion models configured to process the source text prompt and frames from the source digital video; and a target denoising branch of the one or more diffusion models configured to process the target text prompt and the target object of the target digital image. 20 . The one or more computer-readable storage media as described in claim 18 , wherein the generative machine-learning model is configured as a diffusion model.

Assignees

Inventors

Classifications

  • Training; Learning · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Denoising; Smoothing · CPC title

  • using machine learning, e.g. neural networks · CPC title

  • G06T11/60Primary

    Creating or editing images; Combining images with text · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025265752A1 cover?
Digital video editing techniques are described that are based on a target digital image. In one or more implementations, inputs are received. The inputs include a target text prompt, a target digital image depicting a target object, and a source digital video having a plurality of frames depicting a source object. Regions-of-interest are identified in the plurality of frames of the source digit…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06T11/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 21 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).