Video processing method and apparatus, device, and medium
US-2024402902-A1 · Dec 5, 2024 · US
US2026100203A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2026100203-A1 |
| Application number | US-202519348769-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 2, 2025 |
| Priority date | Oct 3, 2024 |
| Publication date | Apr 9, 2026 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method to edit a video includes receiving an input video including a sequence of frames and receiving an editing instruction expressed in natural language. The method also includes generating a multimodal condition based on the textual editing instruction and the input video. The multimodal condition may include an embedding of the input video concatenated with an embedding of the textual editing instruction. The method also includes applying, via a video editing model, the multimodal condition to modify visual content of the input video. The method further includes generating an edited video including visual modifications corresponding to the textual editing instruction. The edited video preserves temporal coherence and overall visual fidelity of the input video.
Opening claim text (preview).
What is claimed is: 1 . A method to edit a video, the method comprising: receiving an input video comprising a sequence of frames; receiving an editing instruction expressed in natural language; generating a multimodal condition based on the textual editing instruction and the input video, the multimodal condition comprising an embedding of the input video concatenated with an embedding of the textual editing instruction; applying, via a video editing model, the multimodal condition to modify visual content of the input video; and generating an edited video comprising visual modifications corresponding to the textual editing instruction, the edited video preserving temporal coherence and overall visual fidelity of the input video. 2 . The method of claim 1 , wherein generating the multimodal condition comprises applying cross-attention between the embedding of the input video and the embedding of the textual editing instruction. 3 . The method of claim 1 , further comprising: generating the embedding of the input video based on encoding the sequence of frames via a temporal autoencoder. 4 . The method of claim 1 , further comprising: generating the embedding of the textual editing instruction based on encoding the instruction with a transformer-based language model. 5 . The method of claim 1 , wherein the video editing model is conditioned on a task embedding corresponding to a type of editing operation comprising one or more of object addition, object removal, background replacement, or attribute modification. 6 . The method of claim 1 , wherein generating the edited video further comprises animating newly generated content such that spatial and temporal consistency across multiple frames is preserved. 7 . The method of claim 1 , wherein preserving temporal coherence comprises aligning positional embeddings of the sequence of frames such that edits applied to a first frame are propagated to subsequent frames. 8 . The method of claim 1 , wherein the visual fidelity is preserved by applying a filtering stage configured to discard edited outputs that are leased than a predetermined quality threshold determined by automated image editing metrics. 9 . An apparatus to edit a video, comprising: one or more processors; and one or more memories coupled with the one or more processors and storing processor-executable code that, when executed by the one or more processors, is configured to cause the apparatus to: receive an input video comprising a sequence of frames; receive an editing instruction expressed in natural language; generate a multimodal condition based on the textual editing instruction and the input video, the multimodal condition comprising an embedding of the input video concatenated with an embedding of the textual editing instruction; apply, via a video editing model, the multimodal condition to modify visual content of the input video; and generate an edited video comprising visual modifications corresponding to the textual editing instruction, the edited video preserving temporal coherence and overall visual fidelity of the input video. 10 . The apparatus of claim 9 , wherein execution of the processor-executable code that causes the apparatus to generate the multimodal condition further causes the apparatus to apply cross-attention between the embedding of the input video and the embedding of the textual editing instruction. 11 . The apparatus of claim 9 , wherein execution of the processor-executable code further causes the apparatus to generate the embedding of the input video based on encoding the sequence of frames via a temporal autoencoder. 12 . The apparatus of claim 9 , wherein execution of the processor-executable code further causes the apparatus to generate the embedding of the textual editing instruction based on encoding the instruction with a transformer-based language model. 13 . The apparatus of claim 9 , wherein the video editing model is conditioned on a task embedding corresponding to a type of editing operation comprising one or more of object addition, object removal, background replacement, or attribute modification. 14 . The apparatus of claim 9 , wherein execution of the processor-executable code further that causes the apparatus to generate the edited video further causes the apparatus to animate newly generated content such that spatial and temporal consistency across multiple frames is preserved. 15 . The apparatus of claim 9 , wherein preserving temporal coherence comprises aligning positional embeddings of the sequence of frames such that edits applied to a first frame are propagated to subsequent frames. 16 . The apparatus of claim 9 , wherein the visual fidelity is preserved by applying a filtering stage configured to discard edited outputs that are leased than a predetermined quality threshold determined by automated image editing metrics. 17 . A non-transitory computer-readable medium having program code recorded thereon for editing a video, the program code executed by one or more processors and comprising: program code to receive an input video comprising a sequence of frames; program code to receive an editing instruction expressed in natural language; program code to generate a multimodal condition based on the textual editing instruction and the input video, the multimodal condition comprising an embedding of the input video concatenated with an embedding of the textual editing instruction; program code to apply, via a video editing model, the multimodal condition to modify visual content of the input video; and generate an edited video comprising visual modifications corresponding to the textual editing instruction, the edited video preserving temporal coherence and overall visual fidelity of the input video. 18 . The non-transitory computer-readable medium of claim 17 , wherein the program code to generate the multimodal condition further comprises program code to apply cross-attention between the embedding of the input video and the embedding of the textual editing instruction. 19 . The non-transitory computer-readable medium of claim 17 , wherein the program code further comprises program code to generate the embedding of the input video based on encoding the sequence of frames via a temporal autoencoder. 20 . The non-transitory computer-readable medium of claim 17 , wherein the program code further comprises program code to generate the embedding of the textual editing instruction based on encoding the instruction with a transformer-based language model.
of characters, e.g. humans, animals or virtual beings · CPC title
Creating or editing images; Combining images with text · CPC title
using neural networks · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
involving special video data, e.g 3D video · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.