Systems and methods for automated movie generation and editing

US2026100204A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2026100204-A1
Application numberUS-202519348778-A
CountryUS
Kind codeA1
Filing dateOct 2, 2025
Priority dateOct 3, 2024
Publication dateApr 9, 2026
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method to generate synchronized audio for a video includes receiving the video including a sequence of frames and receiving a text input describing at least one of a scene, an event, or a mood to be reflected in an audio track. The method also includes generating a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input. The method also includes decoding the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method to generate synchronized audio for a video, the method comprising: receiving the video comprising a sequence of frames; receiving a text input describing one or more of a scene, an event, or a mood to be reflected in an audio track; generating a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input; and decoding the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input. 2 . The method of claim 1 , wherein the audio track comprises one or more of instrumental music, ambient sound, or sound effects. 3 . The method of claim 1 , further comprising: encoding the sequence of frames into video embeddings using a vision encoder, and encoding the text input into text embeddings using a language encoder. 4 . The method of claim 1 , further comprising: concatenating the video embeddings and text embeddings into a multimodal embedding sequence, wherein the audio generation model is conditioned on the multimodal embedding sequence. 5 . The method of claim 1 , wherein decoding the latent audio representation comprises applying a variational autoencoder trained to reconstruct audio signals from compressed latent representations. 6 . The method of claim 1 , wherein the audio context comprises at least one of: audio infilling, audio extension, or audio replacement for one or more frames of the video. 7 . The method of claim 1 , wherein the semantic consistency is based on the correspondence between the generated audio and the text input using a contrastive audio-video-text pre-training model. 8 . An apparatus to generate synchronized audio for a video, the apparatus comprising: one or more processors; and one or more memories coupled with the one or more processors and storing processor-executable code that, when executed by the one or more processors, is configured to cause the apparatus to: receive the video comprising a sequence of frames; receive a text input describing one or more of a scene, an event, or a mood to be reflected in an audio track; generate a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input; and decode the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input. 9 . The apparatus of claim 8 , wherein the audio track comprises one or more of instrumental music, ambient sound, or sound effects. 10 . The apparatus of claim 8 , wherein execution of the processor-executable code further causes the apparatus to encode the sequence of frames into video embeddings using a vision encoder, and encoding the text input into text embeddings using a language encoder. 11 . The apparatus of claim 8 , wherein execution of the processor-executable code further causes the apparatus to concatenate the video embeddings and text embeddings into a multimodal embedding sequence, wherein the audio generation model is conditioned on the multimodal embedding sequence. 12 . The apparatus of claim 8 , wherein execution of the processor-executable code that causes the apparatus to decode the latent audio representation further causes the apparatus to apply a variational autoencoder trained to reconstruct audio signals from compressed latent representations. 13 . The apparatus of claim 8 , wherein the audio context comprises at least one of: audio infilling, audio extension, or audio replacement for one or more frames of the video. 14 . The apparatus of claim 1 , wherein the semantic consistency is based on the correspondence between the generated audio and the text input using a contrastive audio-video-text pre-training model. 15 . A non-transitory computer-readable medium having program code recorded thereon to generate synchronized audio for a video, the program code executed by one or more processors and comprising: program code to receive the video comprising a sequence of frames; program code to receive a text input describing one or more of a scene, an event, or a mood to be reflected in an audio track; program code to generate a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of frames and text embeddings associated with the text input; and program code to decode the latent audio representation to produce an audio track temporally aligned with the video and semantically consistent with the text input. 16 . The non-transitory computer-readable medium of claim 15 , wherein the audio track comprises one or more of instrumental music, ambient sound, or sound effects. 17 . The non-transitory computer-readable medium of claim 15 , wherein the program code further comprises program code to encode the sequence of frames into video embeddings using a vision encoder, and encoding the text input into text embeddings using a language encoder. 18 . The non-transitory computer-readable medium of claim 15 , wherein the program code further comprises program code to concatenate the video embeddings and text embeddings into a multimodal embedding sequence, wherein the audio generation model is conditioned on the multimodal embedding sequence. 19 . The non-transitory computer-readable medium of claim 15 , wherein the program code to decode the latent audio representation further comprises program code to apply a variational autoencoder trained to reconstruct audio signals from compressed latent representations. 20 . The non-transitory computer-readable medium of claim 15 , wherein the audio context comprises at least one of: audio infilling, audio extension, or audio replacement for one or more frames of the video.

Assignees

Inventors

Classifications

  • of characters, e.g. humans, animals or virtual beings · CPC title

  • Creating or editing images; Combining images with text · CPC title

  • using neural networks · CPC title

  • Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • involving special video data, e.g 3D video · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026100204A1 cover?
A method to generate synchronized audio for a video includes receiving the video including a sequence of frames and receiving a text input describing at least one of a scene, an event, or a mood to be reflected in an audio track. The method also includes generating a latent audio representation via an audio generation model conditioned jointly on video embeddings associated with the sequence of…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G11B27/031. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 09 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).