Video captioning generation system and method

US12598360B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12598360-B2
Application numberUS-202318314019-A
CountryUS
Kind codeB2
Filing dateMay 8, 2023
Priority dateMay 8, 2023
Publication dateApr 7, 2026
Grant dateApr 7, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings, and generate video captions based on the multi-modal embeddings using a caption decoder.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A video captioning generation system comprising: a processor and memory of a computing device, the processor being configured to execute a program using portions of memory to: receive an input video; extract video frames from the input video; extract video embeddings and audio embeddings from the video frames, the video embeddings including local video tokens and global video tokens, and the audio embeddings including local audio tokens and global audio tokens; input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings; input the global video tokens, the global audio tokens, the local video tokens, and the local audio tokens into a global cross fusion module of the cross-modal encoder to generate fused local video tokens, fused global video tokens, fused local audio tokens, and fused global audio tokens; and generate video captions based on the multi-modal embeddings using a caption decoder, the multi-modal embeddings including the fused local video tokens, the fused global video tokens, the fused local audio tokens, and the fused global audio tokens. 2 . The video captioning generation system of claim 1 , wherein the generation of video captions is initiated by inputting one or more Beginning of Sentence (BOS) tokens into the caption decoder. 3 . The video captioning generation system of claim 2 , wherein the BOS tokens include: a first BOS token configured to initiate a prediction of current video caption tokens; and a second BOS token configured to initiate a prediction of next video caption tokens. 4 . The video captioning generation system of claim 1 , wherein the cross-modal encoder comprises a merged fusion module configured to: concatenate the local video tokens and local audio tokens; input the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively; and output merged local video tokens and merged local audio tokens. 5 . The video captioning generation system of claim 4 , wherein keys and values of the first transformer are derived from the local video tokens, and queries of the first transformer are derived from the local audio tokens; and keys and values of the second transformer are derived from the local audio tokens, and queries of the second transformer are derived from the local video tokens. 6 . The video captioning generation system of claim 1 , wherein the global cross fusion module comprises a video transformer and an audio transformer; the video transformer receives the local video tokens and the global video tokens as queries, the local video tokens as keys, and the global audio tokens as values, and outputs the fused local video tokens and the fused global video tokens; and the audio transformer receives the local audio tokens and the global audio tokens as queries, the local audio tokens as keys, and the global video tokens as values, and outputs the fused local audio tokens and the fused global audio tokens. 7 . The video captioning generation system of claim 1 , wherein the cross-modal encoder comprises: a merged fusion module configured to: concatenate the local video tokens and local audio tokens, input the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively, and output merged local video tokens and merged local audio tokens, wherein the merged local video tokens and the fused local video tokens are averaged to output averaged local video tokens; the merged local audio tokens and the fused local audio tokens are averaged to output averaged local audio tokens; and the averaged local video tokens and the averaged local audio tokens are iteratively inputted into a subsequent fusion layer of the cross-modal encoder. 8 . The video captioning generation system of claim 1 , wherein a cross-entropy loss function is used to calculate an audio-only decoder loss, a video-only decoder loss, and a multi-modal decoder loss based on the generated video captions and ground truth captions; and trainable parameters of the cross-modal encoder and the caption decoder are updated using the audio-only decoder loss, the video-only decoder loss, and the multi-modal decoder loss. 9 . The video captioning generation system of claim 8 , wherein an audio discrepancy index is calculated between the audio-only decoder loss and the multi-modal decoder loss; a video discrepancy index is calculated between the video-only decoder loss and the multi-modal decoder loss; and trainable parameters of the cross-modal encoder and the caption decoder are updated using the audio discrepancy index and the video discrepancy index. 10 . A video captioning generation method comprising: receiving an input video; extracting video frames from the input video; extracting video embeddings and audio embeddings from the video frames, the video embeddings including local video tokens and global video tokens, and the audio embeddings including local audio tokens and global audio tokens, respectively; inputting the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings; performing global cross fusion on the global video tokens, the global audio tokens, the local video tokens, and the local audio token to generate fused local video tokens, fused global video tokens, fused local audio tokens, and fused global audio tokens; and generating video captions based on the multi-modal embeddings using a caption decoder, the multi-modal embeddings including the fused local video tokens, the fused global video tokens, the fused local audio tokens, and the fused global audio tokens. 11 . The video captioning generation method of claim 10 , further comprising: initiating the generation of video captions by inputting one or more Beginning of Sentence (BOS) tokens into the caption decoder. 12 . The video captioning generation method of claim 11 , further comprising: initiating a prediction of current video caption tokens by inputting a first BOS token of the one or more BOS tokens into the caption decoder; and initiating a prediction of next video caption tokens by inputting a second BOS token of the one or more BOS tokens into the caption decoder. 13 . The video captioning generation method of claim 10 , further comprising performing merged fusion by: concatenating the local video tokens and local audio tokens; inputting the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively; and outputting merged local video tokens and merged local audio tokens. 14 . The video captioning generation method of claim 10 , wherein at a video transformer, the local video tokens and the global video tokens are received as queries, the local video tokens are received as keys, and the global audio tokens are received as values, and the fused local video tokens and the fused global video tokens are outputted by the video transformer; and at an audio transformer, the local audio tokens and the global audio tokens are received as queries, the local audio tokens as keys, and the global video tokens as values, and the fused local audio tokens and the fused global audio tokens are outputted by the audio transformer. 15 . The video captioning generation method of claim 10 , further comprising: performing merged fusion by: concatenating the local

Assignees

Inventors

Classifications

  • involving embedding information at multiplex stream level, e.g. embedding a watermark at packet level · CPC title

  • Generation or processing of descriptive data, e.g. content descriptors {(systems specially adapted for using meta-information in broadcast systems H04H60/73)} · CPC title

  • for displaying subtitles · CPC title

  • involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream (arrangements characterised by components specially adapted for monitoring, identification or recognition of video in broadcast systems H04H60/59) · CPC title

  • involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams (arrangements characterised by components specially adapted for monitoring, identification or recognition of audio in broadcast systems H04H60/58) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12598360B2 cover?
A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio t…
Who is the assignee on this patent?
Lemon Inc
What technology area does this patent fall under?
Primary CPC classification H04N21/23892. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Apr 07 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).