Systems and methods for automated movie generation and editing

US2026101081A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2026101081-A1
Application numberUS-202519348747-A
CountryUS
Kind codeA1
Filing dateOct 2, 2025
Priority dateOct 3, 2024
Publication dateApr 9, 2026
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method to generate a video is provided. The method may include generating, based on a user input including a description of a desired video, a structured script including one or more of scene descriptions, dialogue, or explicit shot-level information. The method also includes generating, based on the structured script, a sequence of video frames representing one or more scenes. The method further includes generating, based on the structured script and the sequence of video frames, an audio track including one or more of ambient sounds, sound effects, or music. The generated audio track being temporally synchronized with the sequence of video frames. The method also includes combining the sequence of video frames with the audio track to generate a synchronized video output representing the desired video.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method to generate a video, comprising: receiving a user input comprising a description of a desired video; generating, based on the user input prompt, a structured script comprising one or more of scene descriptions, dialogue, or explicit shot-level information; generating, based on the structured script, a sequence of video frames representing one or more scenes; generating, based on the structured script and the sequence of video frames, an audio track comprising one or more of ambient sounds, sound effects, or music, the generated audio track being temporally synchronized with the sequence of video frames; and combining the sequence of video frames with the audio track to generate a synchronized video output representing the desired video. 2 . The method of claim 1 , wherein generating the structured script comprises generating a multi-scene screenplay comprising scene-by-scene breakdowns of one or more of characters, environments, or actions. 3 . The method of claim 1 , wherein the sequence of video frames is generated via a video foundation model trained jointly on text-to-image and text-to-video tasks. 4 . The method of claim 1 , wherein generating the sequence of video frames comprises encoding temporal dynamics including object motion, camera motion, and/or subject-object interactions. 5 . The method of claim 1 , wherein the audio track is generated via an audio generative model conditioned on both textual and visual input to align generated sound events with corresponding visual events. 6 . The method of claim 1 , wherein combining the sequence of video frames with the audio track comprises synchronizing onset times of sound effects with detected actions in the video frames. 7 . The method of claim 1 , further comprising: editing the structured script in response to a second user prompt prior to generating the sequence of video frames. 8 . The method of claim 1 , wherein the desired video is generated at a base resolution, and the method further comprises: applying a spatial upsampler to the video to generate a high-definition video output. 9 . The method of claim 1 , wherein: the first generative model comprises a large language model trained on screenplay data; and the first generative model is configured to output the structured script in a machine-readable format comprising one or more of scene headers, action descriptions, dialogue lines, or camera directives. 10 . An apparatus to generate a video, comprising: one or more processors; and one or more memories coupled with the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to: receive a user input comprising a description of a desired video; generate, based on the user input prompt, a structured script comprising one or more of scene descriptions, dialogue, or explicit shot-level information; generate, based on the structured script, a sequence of video frames representing one or more scenes; generate, based on the structured script and the sequence of video frames, an audio track comprising one or more of ambient sounds, sound effects, or music, the generated audio track being temporally synchronized with the sequence of video frames; and combine the sequence of video frames with the audio track to generate a synchronized video output representing the desired video. 11 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is configured to: generate the structured script by generating a multi-scene screenplay comprising scene-by-scene breakdowns of one or more of characters, environments, or actions. 12 . The apparatus of claim 10 , wherein the sequence of video frames is generated via a video foundation model trained jointly on text-to-image and text-to-video tasks. 13 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is further configured to: generate the sequence of video frames further causes the apparatus to encode temporal dynamics including object motion, camera motion, and/or subject-object interactions. 14 . The apparatus of claim 10 , wherein the audio track is generated via an audio generative model conditioned on both textual and visual input to align generated sound events with corresponding visual events. 15 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is further configured to: combine the sequence of video frames with the audio track; and synchronize onset times of sound effects with detected actions in the video frames. 16 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is configured to: edit the structured script in response to a second user prompt prior to the generate the sequence of video frames. 17 . The apparatus of claim 10 , wherein the desired video is generated at a base resolution, and wherein when the one or more processors further execute the instructions, the apparatus is configured to: apply a spatial upsampler to the video to generate a high-definition video output. 18 . The apparatus of claim 10 , wherein: the first generative model comprises a large language model trained on screenplay data; and the first generative model is configured to output the structured script in a machine-readable format comprising one or more of scene headers, action descriptions, dialogue lines, or camera directives. 19 . A non-transitory computer-readable medium storing instructions that, when executed, cause: receiving a user input comprising a description of a desired video; generating, based on the user input prompt, a structured script comprising one or more of scene descriptions, dialogue, or explicit shot-level information; generating, based on the structured script, a sequence of video frames representing one or more scenes; generating, based on the structured script and the sequence of video frames, an audio track comprising one or more of ambient sounds, sound effects, or music, the generated audio track being temporally synchronized with the sequence of video frames; and combining the sequence of video frames with the audio track to generate a synchronized video output representing the desired video. 20 . The non-transitory computer-readable medium of claim 19 , wherein the instructions, when executed, further cause: generating the structured script to generate a multi-scene screenplay comprising scene-by-scene breakdowns of one or more of characters, environments, or actions.

Assignees

Inventors

Classifications

  • of characters, e.g. humans, animals or virtual beings · CPC title

  • Creating or editing images; Combining images with text · CPC title

  • using neural networks · CPC title

  • Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • involving special video data, e.g 3D video · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026101081A1 cover?
A system and method to generate a video is provided. The method may include generating, based on a user input including a description of a desired video, a structured script including one or more of scene descriptions, dialogue, or explicit shot-level information. The method also includes generating, based on the structured script, a sequence of video frames representing one or more scenes. The…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification H04N21/43072. Mapped technology areas include Electricity.
When was this patent published?
Publication date Thu Apr 09 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).