Video synthesis via multimodal conditioning

US12375766B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12375766-B2
Application numberUS-202217957312-A
CountryUS
Kind codeB2
Filing dateSep 30, 2022
Priority dateFeb 14, 2022
Publication dateJul 29, 2025
Grant dateJul 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A multimodal video generation framework (MMVID) that benefits from text and images provided jointly or separately as input. Quantized representations of videos are utilized with a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. A new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens is used to improve video quality and consistency. Text augmentation is utilized to improve the robustness of the textual representation and diversity of generated videos. The framework incorporates various visual modalities, such as segmentation masks, drawings, and partially occluded images. In addition, the MMVID extracts visual information as suggested by a textual prompt.

First claim

Opening claim text (preview).

What is claimed is: 1. A conditional video synthesis method, the method comprising: accessing a multimodal video generation framework (MMVID) comprising a pretrained autoencoder, a language model, a mask-predict algorithm, and a pretrained bidirectional transformer; receiving multimodal input signals; and generating a video by applying the MMVID to the multimodal input signals. 2. The method of claim 1 , wherein the multimodal input signals comprise a visual control and a textual control. 3. The method of claim 2 , wherein the MMVID is a two-stage video generation framework comprising a first stage and a second stage, the method further comprising: quantizing the visual control in the first stage using the pretrained autoencoder; and predicting a video token in the second stage from the multimodal input signals using the pretrained bidirectional transformer. 4. The method of claim 3 , wherein the pretrained autoencoder comprises an encoder and a decoder and wherein the method further comprises obtaining a quantized representation of images using the pretrained autoencoder. 5. The method of claim 4 , wherein the pretrained bidirectional transformer is non-autoregressive. 6. The method of claim 5 , further comprising: pretraining a bidirectional transformer on video tokens by a masked sequence estimation, a relevance estimation, and a video estimation to generate the pretrained bidirectional transformer. 7. The method of claim 1 , wherein textual control and visual control are produced by text augmentation of input text by the language model. 8. The method of claim 7 , wherein the textual control and the visual control are independent. 9. The method of claim 7 , wherein the textual control and the visual control are dependent and wherein the MMVID extracts visual information from the visual control as suggested by the textual control. 10. The method of claim 7 , wherein the visual control consists of a combination of images and videos. 11. The method of claim 7 , wherein generating the video is done by video interpolation. 12. The method of claim 7 , wherein generating the temporally video is done by video extrapolation. 13. A system, comprising; a processor; and a memory storing instructions that, when executed by the processor, configure the system to perform operations comprising; accessing a multimodal video generation framework (MMVID) comprising a pretrained autoencoder, a language model, a mask-predict algorithm, and a pretrained non-autoregressive bidirectional transformer; receiving multimodal input signals; and generating a video by applying the MMVID to the multimodal input signals. 14. The system of claim 13 , wherein the pretrained autoencoder comprises an encoder and a decoder, wherein the pretrained autoencoder is configured to obtain a quantized representation of images, and the pretrained non-autoregressive bidirectional transformer is pretrained on video tokens by a masked sequence estimation, a relevance estimation, and a video estimation. 15. The system of claim 13 , wherein the multimodal input signals comprise a visual control and a textual control, wherein the textual control is produced by text augmentation of input text by the language model, wherein the textual control and the visual control are independent. 16. The system of claim 13 , wherein the multimodal input signals comprise a visual control and a textual control, wherein the textual control is produced by text augmentation of input text by the language model, wherein the textual control and the visual control are dependent. 17. A non-transitory computer-readable storage medium including instruction that when executed by a processor perform operations comprising: accessing a multimodal video generation framework (MMVID) comprising a pretrained autoencoder, a language model, a mask-predict algorithm, and a pretrained non-autoregressive bidirectional transformer; receiving multimodal input signals; and generating a video by applying the MMVID to the multimodal input signals. 18. The non-transitory computer-readable storage medium of claim 17 , wherein the pretrained autoencoder comprises an encoder and a decoder, wherein the pretrained autoencoder is configured to obtain a quantized representation for images, and the pretrained non-autoregressive bidirectional transformer is pretrained on video tokens by a masked sequence estimation, a relevance estimation, and a video estimation. 19. The non-transitory computer-readable storage medium of claim 17 , wherein the multimodal input signals comprise a visual control and a textual control, wherein the textual control is produced by text augmentation of input text by the language model, wherein the textual control and the visual control are independent. 20. The non-transitory computer-readable storage medium of claim 17 , wherein the multimodal input signals comprise a visual control and a textual control, wherein the textual control is produced by text augmentation of input text by the language model, wherein the textual control and the visual control are dependent.

Assignees

Inventors

Classifications

  • G06T11/00Primary

    Two-dimensional [2D] image generation · CPC title

  • involving operations for analysing video streams, e.g. detecting features or characteristics (television picture signal circuitry for scene change detection H04N5/147; filtering for image enhancement G06T5/00; methods or arrangements for recognising scenes G06V20/00; arrangements characterised by components specially adapted for monitoring, identification or recognition of video in broadcast systems H04H60/59) · CPC title

  • for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12375766B2 cover?
A multimodal video generation framework (MMVID) that benefits from text and images provided jointly or separately as input. Quantized representations of videos are utilized with a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. A new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens…
Who is the assignee on this patent?
Barbieri Francesco, Han Ligong, Lee Hsin Ying, and 5 more
What technology area does this patent fall under?
Primary CPC classification G06T11/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).