What technology area does this patent fall under?

Primary CPC classification G06F40/166. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Multimodal video summarization

US12586374B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12586374-B2
Application number	US-202318328597-A
Country	US
Kind code	B2
Filing date	Jun 2, 2023
Priority date	Jun 2, 2023
Publication date	Mar 24, 2026
Grant date	Mar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a video input and a text transcription of the video input. The video input includes a plurality of frames and the text transcription includes a plurality of sentences. The method further includes determining, by a multimodal summarization model, a subset of key frames of the plurality of frames and a subset of key sentences of the plurality of sentences. The method further includes providing a summary of the video input and a summary of the text transcription based on the subset of key frames and the subset of key sentences.

First claim

Opening claim text (preview).

We claim: 1 . A method comprising: receiving a video input and a text transcription of the video input, wherein the video input includes a plurality of frames and the text transcription includes a plurality of sentences; generating a segment of a plurality of segments, the segment comprising a video embedding of a video frame of the plurality of frames and a text embedding of a sentence of the plurality of sentences, wherein the video embedding and the text embedding comprise positional information associated with the video frame and the sentence respectively, and wherein a number of video embeddings of the segment is based on a duration between a start time and an end time associated with the sentence of the text embedding; determining, by a multimodal summarization model that inputs the plurality of segments, a subset of key frames of the plurality of frames and a subset of key sentences of the plurality of sentences; and providing a summary of the video input and a summary of the text transcription based on the subset of key frames and the subset of key sentences. 2 . The method of claim 1 , further comprising: aligning one or more video embeddings corresponding to the plurality of frames with one or more text embeddings corresponding to the plurality of sentences in a temporal domain. 3 . The method of claim 2 , wherein the aligned one or more video embeddings and the one or more text embeddings is based on a start time and an end time associated with one or more sentences of the plurality of sentences. 4 . The method of claim 2 , further comprising: performing cross-attention on the aligned one or more video embeddings with the one or more text embeddings to fuse the one or more video embeddings and the one or more text embeddings in the temporal domain. 5 . The method of claim 4 , wherein the cross-attention is performed using an attention mask that attends one or more video-video embeddings, one or more text-text embeddings, and aligned one or more video embeddings and corresponding one or more text embeddings. 6 . The method of claim 1 , wherein the multimodal summarization model is trained using dual contrastive losses and a classification loss. 7 . The method of claim 6 , wherein a contrastive loss of the dual contrastive loss is an inter-sample contrastive loss determined using a first frame embedding determined from a first training video, a first text embedding determined from a first text transcription associated with the first training video, a second frame embedding determined from a second training video, and a second text embedding determined from a second text transcription associated with the second training video. 8 . The method of claim 6 , wherein a contrastive loss of the dual contrastive loss is an intra-sample contrastive loss determined using a first frame embedding determined from a first temporally aligned one or more frames and corresponding one or more text, a first text embedding determined from the first temporally aligned one or more frames and corresponding one or more text, a second frame embedding determined from a second temporally aligned one or more frames and corresponding one or more text, and a second text embedding determined from the second temporally aligned one or more frames and corresponding one or more text. 9 . A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a video input and a text transcription of the video input, wherein the video input includes a plurality of frames and the text transcription includes a plurality of sentences; generating a segment of a plurality of segments, the segment comprising a video embedding of a video frame of the plurality of frames and a text embedding of a sentence of the plurality of sentences, wherein the video embedding and the text embedding comprise positional information associated with the video frame and the sentence respectively, and wherein a number of video embeddings of the segment is based on a duration between a start time and an end time associated with the sentence of the text embedding; determining, by a multimodal summarization model that inputs the plurality of segments, a subset of key frames of the plurality of frames and a subset of key sentences of the plurality of sentences; and providing a summary of the video input and a summary of the text transcription based on the subset of key frames and the subset of key sentences. 10 . The non-transitory computer-readable medium of claim 9 , storing instructions that further cause the processing device to perform operations comprising: aligning one or more video embeddings corresponding to the plurality of frames with one or more text embeddings corresponding to the plurality of sentences in a temporal domain. 11 . The non-transitory computer-readable medium of claim 10 , wherein the aligned one or more video embeddings and the one or more text embeddings is based on a start time and an end time associated with one or more sentences of the plurality of sentences. 12 . The non-transitory computer-readable medium of claim 10 , storing instructions that further cause the processing device to perform operations comprising: performing cross-attention on the aligned one or more video embeddings with the one or more text embeddings to fuse the one or more video embeddings and the one or more text embeddings in the temporal domain. 13 . The non-transitory computer-readable medium of claim 12 , wherein the cross-attention is performed using an attention mask that attends one or more video-video embeddings, one or more text-text embeddings, and aligned one or more video embeddings and corresponding one or more text embeddings. 14 . The non-transitory computer-readable medium of claim 9 , wherein the multimodal summarization model is trained using dual contrastive losses. 15 . The non-transitory computer-readable medium of claim 14 , wherein: a first contrastive loss of the dual contrastive loss is an inter-sample contrastive loss determined using a first frame embedding determined from a first training video, a first text embedding determined from a first text transcription associated with the first training video, a second frame embedding determined from a second training video, and a second text embedding determined from a second text transcription associated with the second training video, and a second contrastive loss of the dual contrastive loss is an intra-sample contrastive loss determined using a first frame embedding determined from a first temporally aligned one or more frames and corresponding one or more text, a first text embedding determined from the first temporally aligned one or more frames and corresponding one or more text, a second frame embedding determined from a second temporally aligned one or more frames and corresponding one or more text, and a second text embedding determined from the second temporally aligned one or more frames and corresponding one or more text. 16 . A system comprising: a memory component; and a processing device coupled to the memory component, the processing device to perform operations comprising: receiving a video input, a text query, and a text transcription of the video input, wherein the video input includes a plurality of frames, and the text transcription includes a plurality of sentences; generating a segment of a plurality of segments, the segment comprising a video embedding of a video frame of the plurality of frames, a

Assignees

Adobe Inc

Inventors

Classifications

G06V10/776
Validation; Performance evaluation · CPC title
G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
G06F40/166Primary
Editing, e.g. inserting or deleting · CPC title
G06V10/803
of input or preprocessed data · CPC title
G06V10/774
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

View patent family 93652475

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586374B2 cover?: A method includes receiving a video input and a text transcription of the video input. The video input includes a plurality of frames and the text transcription includes a plurality of sentences. The method further includes determining, by a multimodal summarization model, a subset of key frames of the plurality of frames and a subset of key sentences of the plurality of sentences. The method f…
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06F40/166. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).