Systems and methods for generating dynamic virtual representations of an object or event
US-2024420395-A1 · Dec 19, 2024 · US
US12494010B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12494010-B2 |
| Application number | US-202318339341-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 22, 2023 |
| Priority date | Aug 24, 2022 |
| Publication date | Dec 9, 2025 |
| Grant date | Dec 9, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, an electronic device, and a computer program product for video processing are provided in embodiments of the present disclosure. The method generates an avatar image using a reference image and image data for a first frame in a video stream, and generates an avatar video using the avatar image and image data, audio data, and text data in the video stream. Through this solution, a user-defined avatar video adapted to a user of a real video and actions thereof can be generated more accurately and with high quality.
Opening claim text (preview).
What is claimed is: 1 . A method for video processing, comprising: acquiring a video stream, the video stream comprising image data, audio data, and text data corresponding to video frames, and the video frames comprising a first frame; generating a first avatar image using a reference image and image data for the first frame; obtaining a video integration feature based on the first avatar image, the image data, the audio data, and the text data; and generating an avatar video corresponding to the video stream based on the first avatar image and the video integration feature; wherein obtaining the video integration feature based on the first avatar image, the image data, the audio data, and the text data comprises: converting respective features corresponding to the first avatar image, the image data, the audio data, and the text data to respective vectors in a feature space; and obtaining the video integration feature based on the respective vectors in the feature space converted from the respective features corresponding to the first avatar image, the image data, the audio data, and the text data. 2 . The method according to claim 1 , wherein obtaining the video integration feature based on the first avatar image, the image data, the audio data, and the text data comprises: obtaining a first avatar image feature, an image difference feature, an audio feature, and a text feature, wherein the first avatar image feature corresponds to the first avatar image, the image difference feature corresponds to image data of adjacent frames in the video frames, the audio feature corresponds to the audio data, and the text feature corresponds to the text data; and performing integration processing on the first avatar image feature, the image difference feature, the audio feature, and the text feature to obtain the video integration feature. 3 . The method according to claim 2 , wherein performing integration processing on the first avatar image feature, the image difference feature, the audio feature, and the text feature to obtain the video integration feature comprises: converting the first avatar image feature, the image difference feature, the audio feature, and the text feature into a first vector, a second vector, a third vector, and a fourth vector in the feature space, respectively; generating a feature integration vector based on the first vector, the second vector, the third vector, and the fourth vector; generating a residual vector corresponding to the feature integration vector by using an attention mechanism; and obtaining the video integration feature based on the feature integration vector and the residual vector. 4 . The method according to claim 1 , wherein the method is implemented by an avatar video generation model. 5 . The method according to claim 4 , further comprising: obtaining a first loss function based on the avatar video, the audio data, and the text data; and training the avatar video generation model by using the first loss function. 6 . The method according to claim 5 , wherein obtaining a first loss function based on the avatar video, the audio data, and the text data comprises: obtaining a video-audio loss function based on the avatar video and the audio data; obtaining a video-text loss function based on the avatar video and the text data; obtaining an audio-text loss function based on the audio data and the text data; and obtaining the first loss function based on the video-audio loss function, the video-text loss function, and the audio-text loss function. 7 . The method according to claim 6 , wherein the video frames further comprise a second frame, and the method further comprises: obtaining a second loss function based on the first avatar image and a second avatar image for the second frame; and training the avatar video generation model by using the second loss function. 8 . The method according to claim 7 , further comprising: obtaining a third loss function based on the image data and the avatar video; and training the avatar video generation model by using the third loss function. 9 . The method according to claim 8 , wherein the method further comprises: obtaining a fourth loss function based on the first loss function, the second loss function, and the third loss function; and training the avatar video generation model by using the fourth loss function. 10 . An electronic device, comprising: at least one processor; and at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the electronic device to perform operations comprising: acquiring a video stream, the video stream comprising image data, audio data, and text data corresponding to video frames, and the video frames comprising a first frame; generating a first avatar image using a reference image and image data for the first frame; obtaining a video integration feature based on the first avatar image, the image data, the audio data, and the text data; and generating an avatar video corresponding to the video stream based on the first avatar image and the video integration feature; wherein obtaining the video integration feature based on the first avatar image, the image data, the audio data, and the text data comprises: converting respective features corresponding to the first avatar image, the image data, the audio data, and the text data to respective vectors in a feature space; and obtaining the video integration feature based on the respective vectors in the feature space converted from the respective features corresponding to the first avatar image, the image data, the audio data, and the text data. 11 . The electronic device according to claim 10 , wherein obtaining the video integration feature based on the first avatar image, the image data, the audio data, and the text data comprises: obtaining a first avatar image feature, an image difference feature, an audio feature, and a text feature, wherein the first avatar image feature corresponds to the first avatar image, the image difference feature corresponds to image data of adjacent frames in the video frames, the audio feature corresponds to the audio data, and the text feature corresponds to the text data; and performing integration processing on the first avatar image feature, the image difference feature, the audio feature, and the text feature to obtain the video integration feature. 12 . The electronic device according to claim 11 , wherein performing integration processing on the first avatar image feature, the image difference feature, the audio feature, and the text feature to obtain the video integration feature comprises: converting the first avatar image feature, the image difference feature, the audio feature, and the text feature into a first vector, a second vector, a third vector, and a fourth vector in the feature space, respectively; generating a feature integration vector based on the first vector, the second vector, the third vector, and the fourth vector; generating a residual vector corresponding to the feature integration vector by using an attention mechanism; and obtaining the video integration feature based on the feature integration vector and the residual vector. 13 . The electronic device according to claim 10 , wherein the operations are implemented by an avatar video generation model. 14 . The electronic device according to claim 13 , wherein the operations further comprise: obtaining a first loss function based on the avatar video, the audio data, and the text data; and
Related publications grouped by family.
Answers are generated from the same data shown on this page.