Audio fingerprint extraction and audio recognition using said fingerprints

US10657175B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10657175-B2
Application numberUS-201816172610-A
CountryUS
Kind codeB2
Filing dateOct 26, 2018
Priority dateOct 31, 2017
Publication dateMay 19, 2020
Grant dateMay 19, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and a computer-readable storage device are disclosed for generating a frequency representation of a query audio file. The frequency representation represents information about at least a number of frequencies within a time range containing a number of time frames of the audio content information and a level associated with each of said frequencies. At least one of area of data points in the frequency representation is selected. A fingerprint for each selected area of data points is generated by applying a trained neural network onto said selected area of data points thereby generating a vector in a metric space. A distance between at least one of the generated query fingerprints and at least one reference fingerprint is calculated using a specified distance metric. A reference audio file having associated reference fingerprints which have produced at least one associated distance satisfying a predetermined threshold is identified.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for generating fingerprints for audio content information, the method comprising: generating a frequency representation of the audio content information, wherein the frequency representation represents information about at least a number of frequencies within a time range containing a number of time frames of the audio content information and a level associated with each of said frequencies; selecting a plurality of areas from said frequency representation, wherein each of the plurality of areas comprises information about a sub-range of frequencies and a sub-range of time frames of the audio content information; generating a first fingerprint for a first selected area of the plurality of areas by applying a trained neural network that converts the plurality of areas into respective fingerprints in a metric space, wherein the neural network is trained such that the neural network produces the same or similar fingerprints for at least two areas of the plurality of areas when the at least two areas of the plurality of areas contain the same frequency representation of audio content information, regardless of the position of the at least two areas in said frequency representation; comparing the first fingerprint to at least one reference fingerprint accessible by a database and generated from a set of reference audio files, by calculating a distance between the first fingerprint and at least one reference fingerprint, using a specified distance metric; and identifying a reference audio file, having at least one associated reference fingerprint with an associated distance from the first fingerprint satisfying a predetermined threshold. 2. The method as defined in claim 1 , further comprising outputting information about the identified reference audio file to a user. 3. The method as defined in claim 1 , wherein the first selected area is pre-sized. 4. The method as defined in claim 1 , wherein the trained neural network is trained using a Siamese network architecture. 5. The method as defined in claim 4 , wherein the trained neural network is trained such that the trained neural network produces a different fingerprint for the first selected area and at least one additional area of the plurality of areas when the first selected area and the at least one additional area contain different audio content information. 6. The method as defined in claim 1 , wherein the trained neural network comprises a number of convolutional layers and/or filters and/or non-linear functions to be applied to the plurality of areas in sequence. 7. The method as defined in claim 1 , wherein the trained neural network is trained such that the trained neural network produces a different fingerprint for the first selected area and at least one additional area of the plurality of areas when the first selected area and the at least one additional area contain different audio content information. 8. The method as defined in claim 1 , further comprising storing a number of generated fingerprints associated with an audio file containing audio content information in a memory together with some corresponding frequency information. 9. The method as defined in claim 1 , wherein the calculated distance relates to a Hamming distance. 10. The method of claim 1 , wherein: the generated fingerprints are independent of time of the audio content information; and the reference audio file is a derivative work of the audio content information identified by comparing, independently of time, the first fingerprint of the audio content information with at least one reference fingerprint of the reference audio file. 11. The method of claim 1 , further comprising: generating a second fingerprint for a second selected area of the plurality of areas; determining that the first fingerprint and the second fingerprints match; and storing the first fingerprint in the database without storing the second fingerprint that matches the first fingerprint. 12. A non-transitory computer readable storage medium having instructions stored thereon that, when executed by a computing device, cause the computing device to: generate a frequency representation of the audio content information; wherein the frequency representation represents information about at least a number of frequencies within a time range containing a number of time frames of the audio content information and a level associated with each of said frequencies; select a plurality of areas from said frequency representation, wherein each of the plurality of areas comprises information about a sub-range of frequencies and a sub-range of time frames of the audio content information; generate a first fingerprint for a first selected area of the plurality of areas by applying a trained neural network that converts the plurality of areas into respective fingerprints in a metric space, wherein the neural network is trained such that the neural network produces the same or similar fingerprints for at least two areas of the plurality of areas when the at least two areas of the plurality of areas contain the same frequency representation of audio content information, regardless of the position of the at least two areas in said frequency representation; compare the first fingerprint to at least one reference fingerprint accessible by a database and generated from a set of reference audio files, by calculating a distance between the first fingerprint and at least one reference fingerprint, using a specified distance metric; and identify a reference audio file, having at least one associated reference fingerprint with an associated distance from the first fingerprint satisfying a predetermined threshold. 13. A server system comprising one or more processors and memory storing one or more programs executable by the one or more processors, the one or more programs including instructions for: generating a frequency representation of the audio content information, wherein the frequency representation represents information about at least a number of frequencies within a time range containing a number of time frames of the audio content information and a level associated with each of said frequencies; selecting a plurality of areas from said frequency representation, wherein each of the plurality of areas comprises information about a sub-range of frequencies and a sub-range of time frames of the audio content information; generating a first fingerprint for a first selected area of the plurality of areas by applying a trained neural network that converts the plurality of areas into respective fingerprints in a metric space, wherein the neural network is trained such that the neural network produces the same or similar fingerprints for at least two areas of the plurality of areas when the at least two areas of the plurality of areas contain the same frequency representation of audio content information, regardless of the position of the at least two areas in said frequency representation; comparing the first fingerprint to at least one reference fingerprint accessible by a database and generated from a set of reference audio files, by calculating a distance between the first fingerprint and at least one reference fingerprint, using a specified distance metric; and identifying a reference audio file, having at least one associated reference fingerprint with an associated distance from the first fingerprint satisfying a predetermined threshold.

Assignees

Inventors

Classifications

  • for retrieval · CPC title

  • using neural networks · CPC title

  • the extracted parameters being spectral information of each sub-band · CPC title

  • Query formulation · CPC title

  • G06F16/683Primary

    using metadata automatically derived from the content · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10657175B2 cover?
Methods and a computer-readable storage device are disclosed for generating a frequency representation of a query audio file. The frequency representation represents information about at least a number of frequencies within a time range containing a number of time frames of the audio content information and a level associated with each of said frequencies. At least one of area of data points in…
Who is the assignee on this patent?
Spotify Ab
What technology area does this patent fall under?
Primary CPC classification G06F16/683. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 19 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).