Video processing method and apparatus, device, and medium
US-2024402902-A1 · Dec 5, 2024 · US
US2026100204A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2026100204-A1 |
| Application number | US-202519348778-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 2, 2025 |
| Priority date | Oct 3, 2024 |
| Publication date | Apr 9, 2026 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method to generate synchronized audio for a video includes receiving the video including a sequence of frames and receiving a text input describing at least one of a scene, an event, or a mood to be reflected in an audio track. The method also includes generating a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input. The method also includes decoding the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input.
Opening claim text (preview).
What is claimed is: 1 . A method to generate synchronized audio for a video, the method comprising: receiving the video comprising a sequence of frames; receiving a text input describing one or more of a scene, an event, or a mood to be reflected in an audio track; generating a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input; and decoding the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input. 2 . The method of claim 1 , wherein the audio track comprises one or more of instrumental music, ambient sound, or sound effects. 3 . The method of claim 1 , further comprising: encoding the sequence of frames into video embeddings using a vision encoder, and encoding the text input into text embeddings using a language encoder. 4 . The method of claim 1 , further comprising: concatenating the video embeddings and text embeddings into a multimodal embedding sequence, wherein the audio generation model is conditioned on the multimodal embedding sequence. 5 . The method of claim 1 , wherein decoding the latent audio representation comprises applying a variational autoencoder trained to reconstruct audio signals from compressed latent representations. 6 . The method of claim 1 , wherein the audio context comprises at least one of: audio infilling, audio extension, or audio replacement for one or more frames of the video. 7 . The method of claim 1 , wherein the semantic consistency is based on the correspondence between the generated audio and the text input using a contrastive audio-video-text pre-training model. 8 . An apparatus to generate synchronized audio for a video, the apparatus comprising: one or more processors; and one or more memories coupled with the one or more processors and storing processor-executable code that, when executed by the one or more processors, is configured to cause the apparatus to: receive the video comprising a sequence of frames; receive a text input describing one or more of a scene, an event, or a mood to be reflected in an audio track; generate a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input; and decode the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input. 9 . The apparatus of claim 8 , wherein the audio track comprises one or more of instrumental music, ambient sound, or sound effects. 10 . The apparatus of claim 8 , wherein execution of the processor-executable code further causes the apparatus to encode the sequence of frames into video embeddings using a vision encoder, and encoding the text input into text embeddings using a language encoder. 11 . The apparatus of claim 8 , wherein execution of the processor-executable code further causes the apparatus to concatenate the video embeddings and text embeddings into a multimodal embedding sequence, wherein the audio generation model is conditioned on the multimodal embedding sequence. 12 . The apparatus of claim 8 , wherein execution of the processor-executable code that causes the apparatus to decode the latent audio representation further causes the apparatus to apply a variational autoencoder trained to reconstruct audio signals from compressed latent representations. 13 . The apparatus of claim 8 , wherein the audio context comprises at least one of: audio infilling, audio extension, or audio replacement for one or more frames of the video. 14 . The apparatus of claim 1 , wherein the semantic consistency is based on the correspondence between the generated audio and the text input using a contrastive audio-video-text pre-training model. 15 . A non-transitory computer-readable medium having program code recorded thereon to generate synchronized audio for a video, the program code executed by one or more processors and comprising: program code to receive the video comprising a sequence of frames; program code to receive a text input describing one or more of a scene, an event, or a mood to be reflected in an audio track; program code to generate a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input; and program code to decode the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input. 16 . The non-transitory computer-readable medium of claim 15 , wherein the audio track comprises one or more of instrumental music, ambient sound, or sound effects. 17 . The non-transitory computer-readable medium of claim 15 , wherein the program code further comprises program code to encode the sequence of frames into video embeddings using a vision encoder, and encoding the text input into text embeddings using a language encoder. 18 . The non-transitory computer-readable medium of claim 15 , wherein the program code further comprises program code to concatenate the video embeddings and text embeddings into a multimodal embedding sequence, wherein the audio generation model is conditioned on the multimodal embedding sequence. 19 . The non-transitory computer-readable medium of claim 15 , wherein the program code to decode the latent audio representation further comprises program code to apply a variational autoencoder trained to reconstruct audio signals from compressed latent representations. 20 . The non-transitory computer-readable medium of claim 15 , wherein the audio context comprises at least one of: audio infilling, audio extension, or audio replacement for one or more frames of the video.
of characters, e.g. humans, animals or virtual beings · CPC title
Creating or editing images; Combining images with text · CPC title
using neural networks · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
involving special video data, e.g 3D video · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.