Synchronized recording of audio and video with wirelessly connected video and audio recording devices
US-12063409-B2 · Aug 13, 2024 · US
US2026101081A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2026101081-A1 |
| Application number | US-202519348747-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 2, 2025 |
| Priority date | Oct 3, 2024 |
| Publication date | Apr 9, 2026 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system and method to generate a video is provided. The method may include generating, based on a user input including a description of a desired video, a structured script including one or more of scene descriptions, dialogue, or explicit shot-level information. The method also includes generating, based on the structured script, a sequence of video frames representing one or more scenes. The method further includes generating, based on the structured script and the sequence of video frames, an audio track including one or more of ambient sounds, sound effects, or music. The generated audio track being temporally synchronized with the sequence of video frames. The method also includes combining the sequence of video frames with the audio track to generate a synchronized video output representing the desired video.
Opening claim text (preview).
What is claimed is: 1 . A method to generate a video, comprising: receiving a user input comprising a description of a desired video; generating, based on the user input prompt, a structured script comprising one or more of scene descriptions, dialogue, or explicit shot-level information; generating, based on the structured script, a sequence of video frames representing one or more scenes; generating, based on the structured script and the sequence of video frames, an audio track comprising one or more of ambient sounds, sound effects, or music, the generated audio track being temporally synchronized with the sequence of video frames; and combining the sequence of video frames with the audio track to generate a synchronized video output representing the desired video. 2 . The method of claim 1 , wherein generating the structured script comprises generating a multi-scene screenplay comprising scene-by-scene breakdowns of one or more of characters, environments, or actions. 3 . The method of claim 1 , wherein the sequence of video frames is generated via a video foundation model trained jointly on text-to-image and text-to-video tasks. 4 . The method of claim 1 , wherein generating the sequence of video frames comprises encoding temporal dynamics including object motion, camera motion, and/or subject-object interactions. 5 . The method of claim 1 , wherein the audio track is generated via an audio generative model conditioned on both textual and visual input to align generated sound events with corresponding visual events. 6 . The method of claim 1 , wherein combining the sequence of video frames with the audio track comprises synchronizing onset times of sound effects with detected actions in the video frames. 7 . The method of claim 1 , further comprising: editing the structured script in response to a second user prompt prior to generating the sequence of video frames. 8 . The method of claim 1 , wherein the desired video is generated at a base resolution, and the method further comprises: applying a spatial upsampler to the video to generate a high-definition video output. 9 . The method of claim 1 , wherein: the first generative model comprises a large language model trained on screenplay data; and the first generative model is configured to output the structured script in a machine-readable format comprising one or more of scene headers, action descriptions, dialogue lines, or camera directives. 10 . An apparatus to generate a video, comprising: one or more processors; and one or more memories coupled with the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to: receive a user input comprising a description of a desired video; generate, based on the user input prompt, a structured script comprising one or more of scene descriptions, dialogue, or explicit shot-level information; generate, based on the structured script, a sequence of video frames representing one or more scenes; generate, based on the structured script and the sequence of video frames, an audio track comprising one or more of ambient sounds, sound effects, or music, the generated audio track being temporally synchronized with the sequence of video frames; and combine the sequence of video frames with the audio track to generate a synchronized video output representing the desired video. 11 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is configured to: generate the structured script by generating a multi-scene screenplay comprising scene-by-scene breakdowns of one or more of characters, environments, or actions. 12 . The apparatus of claim 10 , wherein the sequence of video frames is generated via a video foundation model trained jointly on text-to-image and text-to-video tasks. 13 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is further configured to: generate the sequence of video frames further causes the apparatus to encode temporal dynamics including object motion, camera motion, and/or subject-object interactions. 14 . The apparatus of claim 10 , wherein the audio track is generated via an audio generative model conditioned on both textual and visual input to align generated sound events with corresponding visual events. 15 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is further configured to: combine the sequence of video frames with the audio track; and synchronize onset times of sound effects with detected actions in the video frames. 16 . The apparatus of claim 10 , wherein when the one or more processors further execute the instructions, the apparatus is configured to: edit the structured script in response to a second user prompt prior to the generate the sequence of video frames. 17 . The apparatus of claim 10 , wherein the desired video is generated at a base resolution, and wherein when the one or more processors further execute the instructions, the apparatus is configured to: apply a spatial upsampler to the video to generate a high-definition video output. 18 . The apparatus of claim 10 , wherein: the first generative model comprises a large language model trained on screenplay data; and the first generative model is configured to output the structured script in a machine-readable format comprising one or more of scene headers, action descriptions, dialogue lines, or camera directives. 19 . A non-transitory computer-readable medium storing instructions that, when executed, cause: receiving a user input comprising a description of a desired video; generating, based on the user input prompt, a structured script comprising one or more of scene descriptions, dialogue, or explicit shot-level information; generating, based on the structured script, a sequence of video frames representing one or more scenes; generating, based on the structured script and the sequence of video frames, an audio track comprising one or more of ambient sounds, sound effects, or music, the generated audio track being temporally synchronized with the sequence of video frames; and combining the sequence of video frames with the audio track to generate a synchronized video output representing the desired video. 20 . The non-transitory computer-readable medium of claim 19 , wherein the instructions, when executed, further cause: generating the structured script to generate a multi-scene screenplay comprising scene-by-scene breakdowns of one or more of characters, environments, or actions.
of characters, e.g. humans, animals or virtual beings · CPC title
Creating or editing images; Combining images with text · CPC title
using neural networks · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
involving special video data, e.g 3D video · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.