Multi-task and multi-lingual emotion mismatch detection for automated dubbing
US-12205614-B1 · Jan 21, 2025 · US
US12340563B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12340563-B2 |
| Application number | US-202217742322-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 11, 2022 |
| Priority date | May 11, 2022 |
| Publication date | Jun 24, 2025 |
| Grant date | Jun 24, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments, extracting visual features for each video sequence segment and audio features for each audio sequence segment, generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer, generating predicted video and audio sequence segment pairings based on the contextualized visual and audio features, and training the visual transformer and the audio transformer to generate the contextualized visual and audio features.
Opening claim text (preview).
We claim: 1. A computer-implemented method comprising: receiving a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence; segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments; extracting visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments; generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer; generating, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features; ranking the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and training the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings. 2. The computer-implemented method of claim 1 , wherein generating the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features comprises: generating, by the visual transformer of the transformer networks, the contextualized visual features from the extracted visual features, wherein contextualized visual features for a first video sequence segment of the set of video sequence segments are based on first visual features for the first video sequence segment and second visual features for one or more other video sequence segments in the set of video sequence segments; and generating, by the audio transformer of the transformer networks, the contextualized audio features from the extracted audio features, wherein contextualized audio features for a first audio sequence segment of the set of audio sequence segments are based on first audio features for the first audio sequence segment and second audio features for one or more other audio sequence segments in the set of audio sequence segments. 3. The computer-implemented method of claim 2 , wherein generating the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features comprises: for each video sequence segment of the set of video sequence segments: calculating a similarity value between the contextualized visual features of the video sequence segment and the contextualized audio features of each audio sequence segment of the set of audio sequence segments; ranking audio sequence segments of the set of audio sequence segments based on the calculated similarity value; and pairing the video sequence segment with an audio sequence segment having a largest similarity value. 4. The computer-implemented method of claim 2 , wherein generating the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features comprises: for each audio sequence segment of the set of audio sequence segments: calculating a similarity value between the contextualized audio features of the audio sequence segment and the contextualized visual features of each video sequence segment of the set of video sequence segments; ranking video sequence segments of the set of video sequence segments based on the calculated similarity value; and pairing the audio sequence segment with a video sequence segment having a largest similarity value. 5. The computer-implemented method of claim 1 , wherein generating the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features comprises: providing an index of sequence segments to one of the visual transformer and the audio transformer, the index indicating an order of the sequence segments. 6. The computer-implemented method of claim 1 , wherein segmenting the media sequence into the set of video sequence segments and the set of audio sequence segments comprises: segmenting the video sequence into a number of video sequence segments, wherein the number of video sequence segments is equal to a number of audio sequence segments segmented from the audio sequence. 7. A non-transitory computer-readable storage medium including instructions stored thereon which, when executed by at least one processor, cause the at least one processor to: receive a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence; segment the media sequence into a set of video sequence segments and a set of audio sequence segments; extract visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments; generate, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer; generate, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features; rank the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and train the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings. 8. The non-transitory computer-readable storage medium of claim 7 , wherein to generate the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features the at least one processor is further caused to: generate, by the visual transformer of the transformer networks, the contextualized visual features from the extracted visual features, wherein contextualized visual features for a first video sequence segment of the set of video sequence segments are based on first visual features for the first video sequence segment and second visual features for one or more other video sequence segments in the set of video sequence segments; and generate, by the audio transformer of the transformer networks, the contextualized audio features from the extracted audio features, wherein contextualized audio features for a first audio sequence segment of the set of audio sequence segments are based on first audio features for the first audio sequence segment and second audio features for one or more other audio sequence segments in the set of audio sequence segments. 9. The non-transitory computer-readable storage medium of claim 8 , wherein to generate the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features the at least one processor is further caused to: for each video sequence segment of the set of video sequence segments: calculate a similarity value between the contextualized visual features of the video sequence segment and the contextualized audio features of each audio sequence segment of the set of aud
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
for processing of video signals · CPC title
Proximity, similarity or dissimilarity measures · CPC title
characterised by the type of extracted parameters · CPC title
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.