What technology area does this patent fall under?

Primary CPC classification H04N21/8456. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Sep 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Determining video provenance utilizing deep learning

US12081827B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12081827-B2
Application number	US-202217822573-A
Country	US
Kind code	B2
Filing date	Aug 26, 2022
Priority date	Aug 26, 2022
Publication date	Sep 3, 2024
Grant date	Sep 3, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to systems, methods, and non-transitory computer readable media that utilize deep learning to map query videos to known videos so as to identify a provenance of the query video or identify editorial manipulations of the query video relative to a known video. For example, the video comparison system includes a deep video comparator model that generates and compares visual and audio descriptors utilizing codewords and an inverse index. The deep video comparator model is robust and ignores discrepancies due to benign transformations that commonly occur during electronic video distribution.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising: sub-dividing a query video into visual segments and audio segments; generating visual descriptors for the visual segments of the query video utilizing a visual neural network encoder; generating audio descriptors for the audio segments of the query video utilizing an audio neural network encoder; determining video segments from a plurality of known videos that are similar to the query video based on the visual descriptors and audio descriptors utilizing an inverse index by: mapping the visual descriptors and the audio descriptors to codewords; and identifying the video segments from the plurality of known videos based on the mapped codewords; and identifying a known video of the plurality of known videos that corresponds to the query video from the determined video segments. 2. The non-transitory computer readable medium of claim 1 , wherein the operations further comprise generating one or more visual indicators identifying locations of editorial modifications in the query video relative to the known video. 3. The non-transitory computer readable medium of claim 1 , wherein sub-dividing the query video into visual segments and audio segments comprises subdividing the query video into equal-length segments. 4. The non-transitory computer readable medium of claim 1 , wherein determining video segments from the plurality of known videos that are similar to the query video based on the visual descriptors and the audio descriptors utilizing an inverse index comprises: identifying one or more known videos that include the codewords, and ranking the one or more known videos. 5. The non-transitory computer readable medium of claim 1 , wherein the operations further comprise fusing the visual descriptors and audio descriptors prior to mapping the visual descriptors and audio descriptors to the codewords. 6. The non-transitory computer readable medium of claim 1 , wherein mapping the visual descriptors and the audio descriptors to the codewords comprises: mapping the visual descriptors to visual codewords; and mapping the audio descriptors to audio codewords. 7. The non-transitory computer readable medium of claim 1 , wherein: the operations further comprise generating unified audio-visual embeddings from corresponding visual and audio descriptors utilizing a fully connected neural network layer; and mapping the visual descriptors and audio descriptors to the codewords comprises mapping unified audio-visual embeddings to a codebook. 8. The non-transitory computer readable medium of claim 1 , wherein determining video segments from a plurality of known videos that are similar to the query video based on the visual descriptors and audio descriptors comprises determining a segment relevance score between a video segment of the known video and a codeword mapped to a segment of the query video by: determining a codeword frequency indicating a number of times the codeword appears in the video segment of the known video; and determining an inverse video frequency that measures how common the codeword is across all video segments in the inverse index. 9. The non-transitory computer readable medium of claim 8 , wherein the operations further comprise: determining a video relevance score by summing segment relevance scores between the video segments of the known video and the mapped codewords; and ranking a subset of known videos from the plurality of known videos corresponding to the determined video segments based on video relevance scores. 10. The non-transitory computer readable medium of claim 9 , wherein identifying the known video of the plurality of known videos that corresponds to the query video from the determined video segments comprises performing edit distance re-ranking of the subset of known videos. 11. The non-transitory computer readable medium of claim 1 , wherein generating visual descriptors for the visual segments of the query video utilizing the visual neural network encoder comprises generating a visual segment embedding for a combination of frames of a visual segment of the query video utilizing the visual neural network encoder. 12. The non-transitory computer readable medium of claim 1 , wherein generating visual descriptors for the visual segments of the query video utilizing the visual neural network encoder comprises: generating frame embeddings for each frame of a visual segment of the query video utilizing the visual neural network encoder; and averaging the frame embeddings for the visual segment to generate a visual descriptor for the visual segment. 13. A system comprising: one or more memory devices comprising a set of known digital videos; and one or more processors that are configured to cause the system to: sub-divide known videos into visual segments and audio segments; generate visual descriptors for the visual segments utilizing a visual neural network encoder; generate audio descriptors for the audio segments utilizing an audio neural network encoder; generate codewords from the audio descriptors and the visual descriptors; generate an inverse index for identifying known videos corresponding to query videos by mapping video segments from the known videos to the codewords; map query video visual descriptors and query video audio descriptors from a query video to the codewords; determine one or more video segments from the known videos that correspond to the query video based on the codewords; and identify a known video of the set of known digital videos that corresponds to the query video from the determined one or more video segments. 14. The system of claim 13 , wherein the one or more processors are further configured to cause the system to generate visual descriptors and audio descriptors that are robust to benign visual and audio perturbations. 15. The system of claim 14 , wherein the one or more processors are further configured to cause the system to learn parameters of the visual neural network encoder utilizing video frames with frame-level augmentations including one or more of random noise, blur, horizonal flip, pixelation, rotation, text overlay, emoji overlay, padding, or color jitter. 16. The system of claim 14 , wherein the one or more processors are further configured to cause the system to learn parameters of the audio neural network encoder utilizing audio segments with augmentations including one or more of audio lengthening, audio shortening, addition of audio components, removal of audio components, or alteration of audio components. 17. The system of claim 13 , wherein the one or more processors are further configured to cause the system to learn parameters of the visual neural network encoder and the audio neural network encoder utilizing a contrastive loss. 18. A computer-implemented method comprising: sub-dividing a query video into visual segments and audio segments; generating visual descriptors for the visual segments of the query video utilizing a visual neural network encoder that is robust to benign visual perturbations; generating audio descriptors for the audio segments of the query video utilizing an audio neural network encoder that is robust to benign audio perturbations; determining video segments from a plurality of known videos that are similar to the query video based on the visual descriptors and audio descriptors utilizing an inverse index by: mapping

Assignees

Inventors

Classifications

G06F16/7867
using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings · CPC title
H04N21/8456Primary
by decomposing the content in the time domain, e.g. in time segments · CPC title
H04N21/84
Generation or processing of descriptive data, e.g. content descriptors {(systems specially adapted for using meta-information in broadcast systems H04H60/73)} · CPC title
G06F16/732
Query formulation · CPC title
G06F16/783
using metadata automatically derived from the content · CPC title

Patent family

Related publications grouped by family.

View patent family 89995864

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12081827B2 cover?: The present disclosure relates to systems, methods, and non-transitory computer readable media that utilize deep learning to map query videos to known videos so as to identify a provenance of the query video or identify editorial manipulations of the query video relative to a known video. For example, the video comparison system includes a deep video comparator model that generates and compares…
Who is the assignee on this patent?: Adobe Inc, Univ Surrey
What technology area does this patent fall under?: Primary CPC classification H04N21/8456. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Sep 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).