Method and System for Generating a Sequence of Actions for Controlling a Robot
US-2024288870-A1 · Aug 29, 2024 · US
US12598360B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12598360-B2 |
| Application number | US-202318314019-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 8, 2023 |
| Priority date | May 8, 2023 |
| Publication date | Apr 7, 2026 |
| Grant date | Apr 7, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings, and generate video captions based on the multi-modal embeddings using a caption decoder.
Opening claim text (preview).
The invention claimed is: 1 . A video captioning generation system comprising: a processor and memory of a computing device, the processor being configured to execute a program using portions of memory to: receive an input video; extract video frames from the input video; extract video embeddings and audio embeddings from the video frames, the video embeddings including local video tokens and global video tokens, and the audio embeddings including local audio tokens and global audio tokens; input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings; input the global video tokens, the global audio tokens, the local video tokens, and the local audio tokens into a global cross fusion module of the cross-modal encoder to generate fused local video tokens, fused global video tokens, fused local audio tokens, and fused global audio tokens; and generate video captions based on the multi-modal embeddings using a caption decoder, the multi-modal embeddings including the fused local video tokens, the fused global video tokens, the fused local audio tokens, and the fused global audio tokens. 2 . The video captioning generation system of claim 1 , wherein the generation of video captions is initiated by inputting one or more Beginning of Sentence (BOS) tokens into the caption decoder. 3 . The video captioning generation system of claim 2 , wherein the BOS tokens include: a first BOS token configured to initiate a prediction of current video caption tokens; and a second BOS token configured to initiate a prediction of next video caption tokens. 4 . The video captioning generation system of claim 1 , wherein the cross-modal encoder comprises a merged fusion module configured to: concatenate the local video tokens and local audio tokens; input the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively; and output merged local video tokens and merged local audio tokens. 5 . The video captioning generation system of claim 4 , wherein keys and values of the first transformer are derived from the local video tokens, and queries of the first transformer are derived from the local audio tokens; and keys and values of the second transformer are derived from the local audio tokens, and queries of the second transformer are derived from the local video tokens. 6 . The video captioning generation system of claim 1 , wherein the global cross fusion module comprises a video transformer and an audio transformer; the video transformer receives the local video tokens and the global video tokens as queries, the local video tokens as keys, and the global audio tokens as values, and outputs the fused local video tokens and the fused global video tokens; and the audio transformer receives the local audio tokens and the global audio tokens as queries, the local audio tokens as keys, and the global video tokens as values, and outputs the fused local audio tokens and the fused global audio tokens. 7 . The video captioning generation system of claim 1 , wherein the cross-modal encoder comprises: a merged fusion module configured to: concatenate the local video tokens and local audio tokens, input the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively, and output merged local video tokens and merged local audio tokens, wherein the merged local video tokens and the fused local video tokens are averaged to output averaged local video tokens; the merged local audio tokens and the fused local audio tokens are averaged to output averaged local audio tokens; and the averaged local video tokens and the averaged local audio tokens are iteratively inputted into a subsequent fusion layer of the cross-modal encoder. 8 . The video captioning generation system of claim 1 , wherein a cross-entropy loss function is used to calculate an audio-only decoder loss, a video-only decoder loss, and a multi-modal decoder loss based on the generated video captions and ground truth captions; and trainable parameters of the cross-modal encoder and the caption decoder are updated using the audio-only decoder loss, the video-only decoder loss, and the multi-modal decoder loss. 9 . The video captioning generation system of claim 8 , wherein an audio discrepancy index is calculated between the audio-only decoder loss and the multi-modal decoder loss; a video discrepancy index is calculated between the video-only decoder loss and the multi-modal decoder loss; and trainable parameters of the cross-modal encoder and the caption decoder are updated using the audio discrepancy index and the video discrepancy index. 10 . A video captioning generation method comprising: receiving an input video; extracting video frames from the input video; extracting video embeddings and audio embeddings from the video frames, the video embeddings including local video tokens and global video tokens, and the audio embeddings including local audio tokens and global audio tokens, respectively; inputting the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings; performing global cross fusion on the global video tokens, the global audio tokens, the local video tokens, and the local audio token to generate fused local video tokens, fused global video tokens, fused local audio tokens, and fused global audio tokens; and generating video captions based on the multi-modal embeddings using a caption decoder, the multi-modal embeddings including the fused local video tokens, the fused global video tokens, the fused local audio tokens, and the fused global audio tokens. 11 . The video captioning generation method of claim 10 , further comprising: initiating the generation of video captions by inputting one or more Beginning of Sentence (BOS) tokens into the caption decoder. 12 . The video captioning generation method of claim 11 , further comprising: initiating a prediction of current video caption tokens by inputting a first BOS token of the one or more BOS tokens into the caption decoder; and initiating a prediction of next video caption tokens by inputting a second BOS token of the one or more BOS tokens into the caption decoder. 13 . The video captioning generation method of claim 10 , further comprising performing merged fusion by: concatenating the local video tokens and local audio tokens; inputting the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively; and outputting merged local video tokens and merged local audio tokens. 14 . The video captioning generation method of claim 10 , wherein at a video transformer, the local video tokens and the global video tokens are received as queries, the local video tokens are received as keys, and the global audio tokens are received as values, and the fused local video tokens and the fused global video tokens are outputted by the video transformer; and at an audio transformer, the local audio tokens and the global audio tokens are received as queries, the local audio tokens as keys, and the global video tokens as values, and the fused local audio tokens and the fused global audio tokens are outputted by the audio transformer. 15 . The video captioning generation method of claim 10 , further comprising: performing merged fusion by: concatenating the local
involving embedding information at multiplex stream level, e.g. embedding a watermark at packet level · CPC title
Generation or processing of descriptive data, e.g. content descriptors {(systems specially adapted for using meta-information in broadcast systems H04H60/73)} · CPC title
for displaying subtitles · CPC title
involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream (arrangements characterised by components specially adapted for monitoring, identification or recognition of video in broadcast systems H04H60/59) · CPC title
involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams (arrangements characterised by components specially adapted for monitoring, identification or recognition of audio in broadcast systems H04H60/58) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.