What technology area does this patent fall under?

Primary CPC classification G06V10/774. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 24 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Self-supervised audio-visual learning for correlating music and video

US12340563B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12340563-B2
Application number	US-202217742322-A
Country	US
Kind code	B2
Filing date	May 11, 2022
Priority date	May 11, 2022
Publication date	Jun 24, 2025
Grant date	Jun 24, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments, extracting visual features for each video sequence segment and audio features for each audio sequence segment, generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer, generating predicted video and audio sequence segment pairings based on the contextualized visual and audio features, and training the visual transformer and the audio transformer to generate the contextualized visual and audio features.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: receiving a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence; segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments; extracting visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments; generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer; generating, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features; ranking the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and training the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings. 2. The computer-implemented method of claim 1 , wherein generating the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features comprises: generating, by the visual transformer of the transformer networks, the contextualized visual features from the extracted visual features, wherein contextualized visual features for a first video sequence segment of the set of video sequence segments are based on first visual features for the first video sequence segment and second visual features for one or more other video sequence segments in the set of video sequence segments; and generating, by the audio transformer of the transformer networks, the contextualized audio features from the extracted audio features, wherein contextualized audio features for a first audio sequence segment of the set of audio sequence segments are based on first audio features for the first audio sequence segment and second audio features for one or more other audio sequence segments in the set of audio sequence segments. 3. The computer-implemented method of claim 2 , wherein generating the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features comprises: for each video sequence segment of the set of video sequence segments: calculating a similarity value between the contextualized visual features of the video sequence segment and the contextualized audio features of each audio sequence segment of the set of audio sequence segments; ranking audio sequence segments of the set of audio sequence segments based on the calculated similarity value; and pairing the video sequence segment with an audio sequence segment having a largest similarity value. 4. The computer-implemented method of claim 2 , wherein generating the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features comprises: for each audio sequence segment of the set of audio sequence segments: calculating a similarity value between the contextualized audio features of the audio sequence segment and the contextualized visual features of each video sequence segment of the set of video sequence segments; ranking video sequence segments of the set of video sequence segments based on the calculated similarity value; and pairing the audio sequence segment with a video sequence segment having a largest similarity value. 5. The computer-implemented method of claim 1 , wherein generating the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features comprises: providing an index of sequence segments to one of the visual transformer and the audio transformer, the index indicating an order of the sequence segments. 6. The computer-implemented method of claim 1 , wherein segmenting the media sequence into the set of video sequence segments and the set of audio sequence segments comprises: segmenting the video sequence into a number of video sequence segments, wherein the number of video sequence segments is equal to a number of audio sequence segments segmented from the audio sequence. 7. A non-transitory computer-readable storage medium including instructions stored thereon which, when executed by at least one processor, cause the at least one processor to: receive a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence; segment the media sequence into a set of video sequence segments and a set of audio sequence segments; extract visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments; generate, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer; generate, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features; rank the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and train the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings. 8. The non-transitory computer-readable storage medium of claim 7 , wherein to generate the contextualized visual features from the extracted visual features and the contextualized audio features from the extracted audio features the at least one processor is further caused to: generate, by the visual transformer of the transformer networks, the contextualized visual features from the extracted visual features, wherein contextualized visual features for a first video sequence segment of the set of video sequence segments are based on first visual features for the first video sequence segment and second visual features for one or more other video sequence segments in the set of video sequence segments; and generate, by the audio transformer of the transformer networks, the contextualized audio features from the extracted audio features, wherein contextualized audio features for a first audio sequence segment of the set of audio sequence segments are based on first audio features for the first audio sequence segment and second audio features for one or more other audio sequence segments in the set of audio sequence segments. 9. The non-transitory computer-readable storage medium of claim 8 , wherein to generate the predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features the at least one processor is further caused to: for each video sequence segment of the set of video sequence segments: calculate a similarity value between the contextualized visual features of the video sequence segment and the contextualized audio features of each audio sequence segment of the set of aud

Assignees

Adobe Inc

Inventors

Classifications

G06V20/46
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
G10L25/57
for processing of video signals · CPC title
G06V10/761
Proximity, similarity or dissimilarity measures · CPC title
G10L25/03
characterised by the type of extracted parameters · CPC title
G06V20/49
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title

Patent family

Related publications grouped by family.

View patent family 88699310

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12340563B2 cover?: Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequ…
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06V10/774. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 24 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Multi-task and multi-lingual emotion mismatch detection for automated dubbing

Artificial intelligence models for composing audio scores

Dual-modality relation networks for audio-visual event localization

Ai-assisted sound effect generation for silent video

Method, system and electronic device for processing audio-visual data

Frequently asked questions