Animation processing method
US-2024420402-A1 · Dec 19, 2024 · US
US2026099978A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2026099978-A1 |
| Application number | US-202519348764-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 2, 2025 |
| Priority date | Oct 3, 2024 |
| Publication date | Apr 9, 2026 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method to generate a video includes receiving an input describing a scene. The method also includes receiving a reference image depicting a character. The method further includes generating, via an encoder, embeddings of identity features of the reference image. The method also includes generating, via a video generation model, the video in which the character appears with consistent likeness across multiple frames in accordance with the embeddings and the text prompt.
Opening claim text (preview).
What is claimed is: 1 . A method to generate a video, the method comprising: receiving an input describing a scene; receiving a reference image depicting a character; generating, via an encoder, embeddings of identity features of the reference image; and generating, via a video generation model, the video in which the character appears with consistent likeness across multiple frames in accordance with the embeddings and the text prompt. 2 . The method of claim 1 , further comprising: generating, via a transformer, a joint multimodal embedding sequence based on concatenating the embeddings with text prompt embeddings associated with the text prompt. 3 . The method of claim 2 , further comprising: projecting the embeddings into a common latent space dimension of the video generation model prior to the concatenation with the text prompt embeddings. 4 . The method of claim 2 , wherein the embeddings are concatenated with the text prompt embeddings via a learned gating mechanism that dynamically weights identity features relative to textual features. 5 . The method of claim 1 , wherein the embeddings are injected into a cross-attention layer of the video generation model to condition hidden representations derived from the text prompt. 6 . The method of claim 1 , further comprising: generating multiple scenes with different inputs while maintaining the consistent likeness of the character across all scenes. 7 . The method of claim 1 , wherein maintaining the consistent likeness comprises preserving one or more of facial expressions, hairstyle, clothing, or other distinguishing features of the reference image. 8 . An apparatus to generate a video, the apparatus comprising: one or more processors; and one or more memories coupled with the one or more processors and storing processor-executable code that, when executed by the one or more processors, is configured to cause the apparatus to: receive an input describing a scene; receive a reference image depicting a character; generate, via an encoder, embeddings of identity features of the reference image; and generate, via a video generation model, the video in which the character appears with consistent likeness across multiple frames in accordance with the embeddings and the text prompt. 9 . The apparatus of claim 8 , wherein execution of the processor-executable code further causes the apparatus to generate, via a transformer, a joint multimodal embedding sequence based on concatenating the embeddings with text prompt embeddings associated with the text prompt. 10 . The apparatus of claim 9 , wherein execution of the processor-executable code further causes the apparatus to project the embeddings into a common latent space dimension of the video generation model prior to the concatenation with the text prompt embeddings. 11 . The apparatus of claim 9 , wherein the embeddings are concatenated with the text prompt embeddings via a learned gating mechanism that dynamically weights identity features relative to textual features. 12 . The apparatus of claim 8 , wherein the embeddings are injected into a cross-attention layer of the video generation model to condition hidden representations derived from the text prompt. 13 . The apparatus of claim 8 , wherein execution of the processor-executable code further causes the apparatus to generate multiple scenes with different inputs while maintaining the consistent likeness of the character across all scenes. 14 . The apparatus of claim 8 , wherein execution of the processor-executable code that causes the apparatus to maintain the consistent likeness further causes the apparatus to preserve one or more of facial expressions, hairstyle, clothing, or other distinguishing features of the reference image. 15 . A non-transitory computer-readable medium having program code recorded thereon to generate a video, the program code executed by one or more processors and comprising: program code to receive an input describing a scene; program code to receive a reference image depicting a character; program code to generate, via an encoder, embeddings of identity features of the reference image; and program code to generate, via a video generation model, the video in which the character appears with consistent likeness across multiple frames in accordance with the embeddings and the text prompt. 16 . The non-transitory computer-readable medium of claim 15 , wherein the program code further comprises program code to generate, via a transformer, a joint multimodal embedding sequence based on concatenating the embeddings with text prompt embeddings associated with the text prompt. 17 . The non-transitory computer-readable medium of claim 16 , wherein the program code further comprises program code to project the embeddings into a common latent space dimension of the video generation model prior to the concatenation with the text prompt embeddings. 18 . The non-transitory computer-readable medium of claim 16 , wherein the embeddings are concatenated with the text prompt embeddings via a learned gating mechanism that dynamically weights identity features relative to textual features. 19 . The non-transitory computer-readable medium of claim 15 , wherein the embeddings are injected into a cross-attention layer of the video generation model to condition hidden representations derived from the text prompt. 20 . The non-transitory computer-readable medium of claim 15 , wherein the program code further comprises program code to generate multiple scenes with different inputs while maintaining the consistent likeness of the character across all scenes.
of characters, e.g. humans, animals or virtual beings · CPC title
Creating or editing images; Combining images with text · CPC title
using neural networks · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
involving special video data, e.g 3D video · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.