Self-supervised audio-visual learning for correlating music and video

US12340563B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12340563-B2
Application numberUS-202217742322-A
CountryUS
Kind codeB2
Filing dateMay 11, 2022
Priority dateMay 11, 2022
Publication dateJun 24, 2025
Grant dateJun 24, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments, extracting visual features for each video sequence segment and audio features for each audio sequence segment, generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer, generating predicted video and audio sequence segment pairings based on the contextualized visual and audio features, and training the visual transformer and the audio transformer to generate the contextualized visual and audio features.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: receiving a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence; segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments; extracting visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments; generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer; generating, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features; ranking the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and training the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings. 2. The computer-implemented method of claim 1 , wherein generating the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features comprises: generating, by the visual transformer of the transformer networks, the contextualized visual features from the extracted visual features, wherein contextualized visual features for a first video sequence segment of the set of video sequence segments are based on first visual features for the first video sequence segment and second visual features for one or more other video sequence segments in the set of video sequence segments; and generating, by the audio transformer of the transformer networks, the contextualized audio features from the extracted audio features, wherein contextualized audio features for a first audio sequence segment of the set of audio sequence segments are based on first audio features for the first audio sequence segment and second audio features for one or more other audio sequence segments in the set of audio sequence segments. 3. The computer-implemented method of claim 2 , wherein generating the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features comprises: for each video sequence segment of the set of video sequence segments: calculating a similarity value between the contextualized visual features of the video sequence segment and the contextualized audio features of each audio sequence segment of the set of audio sequence segments; ranking audio sequence segments of the set of audio sequence segments based on the calculated similarity value; and pairing the video sequence segment with an audio sequence segment having a largest similarity value. 4. The computer-implemented method of claim 2 , wherein generating the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features comprises: for each audio sequence segment of the set of audio sequence segments: calculating a similarity value between the contextualized audio features of the audio sequence segment and the contextualized visual features of each video sequence segment of the set of video sequence segments; ranking video sequence segments of the set of video sequence segments based on the calculated similarity value; and pairing the audio sequence segment with a video sequence segment having a largest similarity value. 5. The computer-implemented method of claim 1 , wherein generating the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features comprises: providing an index of sequence segments to one of the visual transformer and the audio transformer, the index indicating an order of the sequence segments. 6. The computer-implemented method of claim 1 , wherein segmenting the media sequence into the set of video sequence segments and the set of audio sequence segments comprises: segmenting the video sequence into a number of video sequence segments, wherein the number of video sequence segments is equal to a number of audio sequence segments segmented from the audio sequence. 7. A non-transitory computer-readable storage medium including instructions stored thereon which, when executed by at least one processor, cause the at least one processor to: receive a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence; segment the media sequence into a set of video sequence segments and a set of audio sequence segments; extract visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments; generate, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer; generate, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features; rank the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and train the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings. 8. The non-transitory computer-readable storage medium of claim 7 , wherein to generate the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features the at least one processor is further caused to: generate, by the visual transformer of the transformer networks, the contextualized visual features from the extracted visual features, wherein contextualized visual features for a first video sequence segment of the set of video sequence segments are based on first visual features for the first video sequence segment and second visual features for one or more other video sequence segments in the set of video sequence segments; and generate, by the audio transformer of the transformer networks, the contextualized audio features from the extracted audio features, wherein contextualized audio features for a first audio sequence segment of the set of audio sequence segments are based on first audio features for the first audio sequence segment and second audio features for one or more other audio sequence segments in the set of audio sequence segments. 9. The non-transitory computer-readable storage medium of claim 8 , wherein to generate the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features the at least one processor is further caused to: for each video sequence segment of the set of video sequence segments: calculate a similarity value between the contextualized visual features of the video sequence segment and the contextualized audio features of each audio sequence segment of the set of aud

Assignees

Inventors

Classifications

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • for processing of video signals · CPC title

  • Proximity, similarity or dissimilarity measures · CPC title

  • characterised by the type of extracted parameters · CPC title

  • Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12340563B2 cover?
Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequ…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/774. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 24 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).