Generating videos using sequences of generative neural networks
US-2024320965-A1 · Sep 26, 2024 · US
US2025265752A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025265752-A1 |
| Application number | US-202418583067-A |
| Country | US |
| Kind code | A1 |
| Filing date | Feb 21, 2024 |
| Priority date | Feb 21, 2024 |
| Publication date | Aug 21, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Digital video editing techniques are described that are based on a target digital image. In one or more implementations, inputs are received. The inputs include a target text prompt, a target digital image depicting a target object, and a source digital video having a plurality of frames depicting a source object. Regions-of-interest are identified in the plurality of frames of the source digital video, respectively, based on the target text prompt and the target digital image using a machine-learning model, e.g., a diffusion model. A plurality of frames of a target digital video are generated as having the target object using a generative machine-learning model. The generating is based on the regions-of-interest, the target digital image, the source digital video, and a source text prompt describing the source digital video.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: receiving, by a processing device, a target text prompt, a target digital image depicting a target object, and a source digital video having a plurality of frames depicting a source object; identifying, by the processing device, regions-of-interest in the plurality of frames of the source digital video, respectively, based on the target text prompt and the target digital image using a machine-learning model; generating, by the processing device, a plurality of frames of a target digital video having the target object using a generative machine-learning model, the generating based on the regions-of-interest, the target digital image, the source digital video, and a source text prompt describing the source digital video; and outputting, by the processing device, the target digital video. 2 . The method as described in claim 1 , wherein the plurality of frames of the target digital video depicts the target object as following motion exhibited by the source object in the source digital video. 3 . The method as described in claim 1 , wherein the identifying the regions-of-interest includes forming a plurality of masks defining, respectively, the regions-of-interest. 4 . The method as described in claim 3 , wherein the forming the plurality of masks is based, at least in part, on the target text prompt and the target digital image. 5 . The method as described in claim 1 , wherein the machine-learning model, utilized to perform the identifying of the regions-of-interest, is configured as one or more diffusion models. 6 . The method as described in claim 5 , wherein the one or more diffusion models include: a source denoising branch configured to process the source text prompt; and a target denoising branch configured to process the target text prompt and the target object of the target digital image. 7 . The method as described in claim 6 , wherein the identifying includes comparing noise differences as a reconstruction loss across respective timesteps between the source denoising branch and the target denoising branch. 8 . The method as described in claim 7 , wherein the identifying further comprises averaging and binarizing the noise differences to form a plurality of masks defining, respectively, the regions-of-interest. 9 . The method as described in claim 1 , wherein the generative machine-learning model, utilized to generate the plurality of frames, is configured as one or more diffusion models. 10 . The method as described in claim 1 , wherein the generating of the plurality of frames of the target digital video includes calculating a latent correction during inference involving inter-frame temporal consistency. 11 . The method as described in claim 10 , wherein the calculating includes computing inter-frame latent fields by mapping spatial locations of features between the plurality of frames of the target digital video. 12 . The method as described in claim 11 , further comprising blending the computed inter-frame latent fields at a plurality of timesteps corresponding to the plurality of frames of the target digital video. 13 . The method as described in claim 1 , wherein the generating of the plurality of frames of the target digital video includes preserving a background of the source digital video by correcting latent noise corresponding to the background based on the regions-of-interest. 14 . A computing device comprising: a processing device; and a computer-readable storage medium storing instructions that, in response to execution by the processing device, causes the processing device to perform operations including: receiving a target text prompt, a target digital image depicting a target object, a source digital video having a plurality of frames depicting a source object, and a source text prompt describing the source digital video; generating a plurality of masks defining regions-of-interest in the plurality of frames of the source digital video using a machine-learning model, the generating based on the source digital video, the target object, the target text prompt, and the source text prompt; and generating a plurality of frames of a target digital video having the target object as following motion of the source object using a generative machine-learning model based on the plurality of masks. 15 . The computing device as described in claim 14 , wherein the machine-learning model utilized to perform the generating of the plurality of masks is configured as one or more diffusion models. 16 . The computing device as described in claim 15 , wherein the generating of the plurality of masks includes comparing noise differences across respective timesteps between: a source denoising branch of the one or more diffusion models configured to process the source text prompt and frames from the source digital video; and a target denoising branch of the one or more diffusion models configured to process the target text prompt and the target object of the target digital image. 17 . The computing device as described in claim 14 , wherein the generating the plurality of frames of the target digital video is performed using a generative machine-learning model based on the regions-of-interest, the target digital image, the source digital video, and the source text prompt describing the source digital video. 18 . One or more computer-readable storage media storing instructions that, in response to execution by a processing device, causes the processing device to perform operations comprising: receiving a target text prompt, a target digital image depicting a target object, a source digital video having a plurality of frames depicting a source object, and a source text prompt describing the source digital video; generating a plurality of masks defining regions-of-interest in the plurality of frames of the source digital video; and generating a plurality of frames of a target digital video having the target object using a generative machine-learning model, the generating based on the regions-of-interest, the target digital image, the source digital video, and a source text prompt describing the source digital video. 19 . The one or more computer-readable storage media as described in claim 18 , wherein the generating a plurality of masks is performed using one or more diffusion models by comparing noise differences across respective timesteps between: a source denoising branch of the one or more diffusion models configured to process the source text prompt and frames from the source digital video; and a target denoising branch of the one or more diffusion models configured to process the target text prompt and the target object of the target digital image. 20 . The one or more computer-readable storage media as described in claim 18 , wherein the generative machine-learning model is configured as a diffusion model.
Training; Learning · CPC title
Artificial neural networks [ANN] · CPC title
Denoising; Smoothing · CPC title
using machine learning, e.g. neural networks · CPC title
Creating or editing images; Combining images with text · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.