Method, electronic device, and computer program product for generating video

US2026004495A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2026004495-A1
Application numberUS-202418782969-A
CountryUS
Kind codeA1
Filing dateJul 24, 2024
Priority dateJun 28, 2024
Publication dateJan 1, 2026
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes obtaining a reference image and a reference speech, the reference image specifying a head of a target object in the video, and the reference speech specifying a voice of the target object; and generating, based on the reference image and the reference speech, a fusion vector by combining a feature of the head and a feature of the voice. The method further includes generating, based on the fusion vector, a plurality of video frames in a video that represents the target object speaking in a timbre of the reference speech by denoising a plurality of initial frames including noise; and generating the video based on the plurality of video frames. In embodiments of the present disclosure, a video in which a semantic feature and a speaking style of the target object are merged can be generated, and the resolution and quality of the generated video are enhanced.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for generating a video, comprising: obtaining a reference image and a reference speech, the reference image specifying a head of a target object in the video, and the reference speech specifying a voice of the target object; generating, based on the reference image and the reference speech, a fusion vector by combining a feature of the head and a feature of the voice; generating, based on the fusion vector, a plurality of video frames in the video that represents the target object speaking in a timbre of the reference speech by denoising a plurality of initial frames comprising noise; and generating the video based on the plurality of video frames. 2 . The method according to claim 1 , wherein combining the feature of the head and the feature of the voice comprises: generating a first vector based on the reference image; generating a second vector based on the reference speech; generating a third vector by concatenating the first vector and the second vector; and generating the fusion vector based on a linear projection of the third vector. 3 . The method according to claim 1 , wherein generating, based on the reference image and the reference speech, the fusion vector by combining the feature of the head and the feature of the voice comprises: generating corresponding text based on the reference speech; and generating the fusion vector comprising a semantic feature of the text and a style feature of the reference image based on the reference image and the corresponding text. 4 . The method according to claim 3 , wherein the style feature comprises at least one of face shape, hair style, skin color, and facial expressions. 5 . The method according to claim 1 , wherein generating, based on the fusion vector, the plurality of video frames in the video that represents the target object speaking in the timbre of the reference speech by denoising the plurality of initial frames comprising noise comprises: sampling a first image based on a predetermined data distribution and Gaussian noise; predicting first noise in the first image under a constraint of the fusion vector; and acquiring a second image by removing the first noise from the first image. 6 . The method according to claim 5 , further comprising: predicting, based on the second image and the Gaussian noise, second noise of the second) image under the constraint of the fusion vector; acquiring a de-noised second image by removing the second noise from the second image; and determining the plurality of video frames by maximizing a log likelihood associated with a combination of the first image, the second image, the first noise, and the second noise. 7 . The method according to claim 1 , wherein a resolution of the video is higher than a resolution of the plurality of video frames, and wherein generating the video with the resolution higher than the resolution of the plurality of video frames based on the plurality of video frames comprises: generating a video frame set with increased resolution based on each video frame of the plurality of video frames; and generating, based on the video frame set with increased resolution, the video that is consistent with at least one of expression, action, and texture comprised in the reference image and the reference speech, wherein the resolution and frame rate of the video are higher than those of the video frame set. 8 . The method according to claim 1 , wherein the method is performed in a multi-modal machine learning model, and training the multi-modal machine learning model comprises: determining a first loss function associated with a difference between features of training frames and sample frames; determining a second loss function associated with a difference between distribution of the training frames and distribution of the sample frames; determining a third loss function associated with a difference between a lip motion of the training frames and a sample speech; and training the multi-modal machine learning model based on the first loss function, the second loss function, and the third loss function. 9 . The method according to claim 8 , wherein training the multi-modal machine learning model based on the first loss function, the second loss function, and the third loss function comprises: determining a first weight of the first loss function, a second weight of the second loss function, and a third weight of the third loss function respectively; determining a final loss function based on the first weight, the first weight, and the third weight; and training the multi-modal machine learning model based on the final loss function. 10 . The method according to claim 9 , further comprising: balancing at least one of lip motion similarity, face similarity, and style similarity in the generated video by adjusting the first weight, the first weight, and the third weight. 11 . An electronic device, comprising: at least one processor; and a memory coupled to the at least one processor, wherein the memory has instructions stored therein, and the instructions, when executed by the at least one processor, cause the electronic device to perform actions comprising: obtaining a reference image and a reference speech, the reference image specifying the head of a target object in a video, and the reference speech specifying voice of the target object; generating, based on the reference image and the reference speech, a fusion vector by combining a feature of the head and a feature of the voice; generating, based on the fusion vector, a plurality of video frames in the video that represents the target object speaking in a timbre of the reference speech by denoising a plurality of initial frames comprising noise; and generating the video based on the plurality of video frames. 12 . The electronic device according to claim 11 , wherein combining the feature of the head and the feature of the voice comprises: generating a first vector based on the reference image; generating a second vector based on the reference speech; generating a third vector by concatenating the first vector and the second vector; and generating the fusion vector based on a linear projection of the third vector. 13 . The electronic device according to claim 11 , wherein generating, based on the reference image and the reference speech, the fusion vector by combining the feature of the head and the feature of the voice comprises: generating corresponding text based on the reference speech; and generating the fusion vector comprising a semantic feature of the text and a style feature of the reference image based on the reference image and the corresponding text. 14 . The electronic device according to claim 13 , wherein the style feature comprises at least one of face shape, hair style, skin color, and facial expressions. 15 . The electronic device according to claim 11 , wherein generating, based on the fusion vector, the plurality of video frames in the video that represents the target object speaking in the timbre of the reference speech by denoising the plurality of initial frames comprising noise comprises: sampling a first image based on a predetermined data distribution and Gaussian noise; predicting first noise in the first image under a constraint of the fusion vector; and acquiring a second image by removing the first noise from the first image. 16 . The electronic device according to claim 15 , wherein the actions further comprise: predicting, based on the second image and the Gaussian noise, second nois

Assignees

Inventors

Classifications

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Video; Image sequence · CPC title

  • for processing of video signals · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

  • Two-dimensional [2D] animation, e.g. using sprites · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026004495A1 cover?
A method includes obtaining a reference image and a reference speech, the reference image specifying a head of a target object in the video, and the reference speech specifying a voice of the target object; and generating, based on the reference image and the reference speech, a fusion vector by combining a feature of the head and a feature of the voice. The method further includes generating, …
Who is the assignee on this patent?
Dell Products Lp
What technology area does this patent fall under?
Primary CPC classification G06T13/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).