Unsupervised learning of semantic audio representations

US11335328B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11335328-B2
Application numberUS-201816758564-A
CountryUS
Kind codeB2
Filing dateOct 26, 2018
Priority dateOct 27, 2017
Publication dateMay 17, 2022
Grant dateMay 17, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods are provided for generating training triplets that can be used to train multidimensional embeddings to represent the semantic content of non-speech sounds present in a corpus of audio recordings. These training triplets can be used with a triplet loss function to train the multidimensional embeddings such that the embeddings can be used to cluster the contents of a corpus of audio recordings, to facilitate a query-by-example lookup from the corpus, to allow a small number of manually-labeled audio recordings to be generalized, or to facilitate some other audio classification task. The triplet sampling methods may be used individually or collectively, and each represent a respective heuristic about the semantic structure of audio recordings.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: obtaining training data, wherein the training data comprises a plurality of sound recordings; generating a plurality of training triplets, wherein each training triplet of the plurality of training triplets includes a respective anchor audio segment, a positive audio segment, and a negative audio segment from the plurality of sound recordings, wherein generating the plurality of training triplets comprises: (i) performing a first triplet sampling operation to generate a first subset of training triplets of the plurality of training triplets; and (ii) performing a second triplet sampling operation to generate a second subset of training triplets of the plurality of training triplets, wherein the second triplet sampling operation is a different triplet sampling operation from the first triplet sampling operation; applying a mapping to determine, for each audio segment of each training triplet of the plurality of training triplets, a respective feature vector in an n-dimensional feature space; and updating the mapping based on the determined feature vectors such that a loss function is reduced, wherein the loss function comprises a sum of a plurality of terms, wherein each term in the plurality of terms corresponds to a respective training triplet in the plurality of training triplets, and wherein a term of the loss function that corresponds to a particular training triplet is increased by increasing a first distance relative to a second distance when the first distance is not less than the second distance by at least a specified threshold amount, wherein the first distance is between the feature vector of the anchor audio segment of the particular training triplet and the feature vector of the positive audio segment of the particular training triplet, and wherein the second distance is between the feature vector of the anchor audio segment of the particular training triplet and the feature vector of the negative audio segment of the particular training triplet; and applying the updated mapping to at least one of: (i) determine an additional feature vector in the n-dimensional feature space for an additional segment of sound data, (ii) determine at least two cluster locations within the n-dimensional feature space that correspond to respective clusters of sound recordings within the training data, or (iii) train a classifier to classify an additional segment of sound data based on an additional feature vector that is generated by applying the updated mapping to the additional segment of sound data. 2. The method of claim 1 , wherein applying the mapping to determine, for a given audio segment, a corresponding feature vector in the n-dimensional feature space comprises: determining a spectrogram based on the given audio segment; and applying the mapping to the determined spectrogram to determine the corresponding feature vector in the n-dimensional feature space. 3. The method of claim 1 , wherein performing the first triplet sampling operation comprises: selecting, for a particular training triplet of the first subset of training triplets, an anchor audio segment from the plurality of sound recordings; determining a positive audio segment for the particular training triplet by adding noise to the anchor audio segment of the particular training triplet; and determining a negative audio segment for the particular training triplet by selecting, from the plurality of sound recordings, an audio segment that differs from the anchor audio segment of the particular training triplet. 4. The method of claim 1 , wherein performing the first triplet sampling operation comprises: selecting, for a particular training triplet of the first subset of training triplets, an anchor audio segment from the plurality of sound recordings; determining a positive audio segment for the first training triplet by applying at least one of a frequency shift or a time shift to the anchor audio segment for the first training triplet; and determining a negative audio segment for the particular training triplet by selecting, from the plurality of sound recordings, an audio segment that differs from the anchor audio segment of the particular training triplet. 5. The method of claim 1 , wherein performing the first triplet sampling operation comprises: selecting, for a particular training triplet of the first subset of training triplets, an anchor audio segment from the plurality of sound recordings; determining a negative audio segment for the particular training triplet by selecting, from the plurality of sound recordings, an audio segment that differs from the anchor audio segment of the particular training triplet; and determining a positive audio segment for the particular training triplet by determining a weighted combination of the anchor audio segment for the particular training triplet and the negative audio segment for the particular training triplet. 6. The method of claim 1 , wherein performing the first triplet sampling operation comprises: selecting, for a particular training triplet of the first subset of training triplets, an anchor audio segment from the plurality of sound recordings; determining a positive audio segment for the particular training triplet by selecting, from the plurality of sound recordings, an audio segment that differs from the anchor audio segment of the particular training triplet, wherein the anchor audio segment for the particular training triplet and the positive audio segment for the particular training triplet correspond to respective segments of a first sound recording of the training data; and determining a negative audio segment for the particular training triplet by selecting, from the plurality of sound recordings, an audio segment that differs from both of the anchor audio segment for the particular training triplet and the positive audio segment for the particular training triplet, wherein the negative audio segment for the particular training triplet corresponds to a segment of a second sound recording of the training data, wherein the second sound recording differs from the first sound recording. 7. The method of claim 1 , wherein the first distance is a Euclidean distance within the n-dimensional feature space between the feature vector of the anchor audio segment of the particular training triplet and the feature vector of the positive audio segment of the particular training triplet, wherein the second distance is a Euclidean distance within the n-dimensional feature space between the feature vector of the anchor audio segment of the particular training triplet and the feature vector of the negative audio segment of the particular training triplet, and wherein the term of the loss function that corresponds to the particular training triplet comprises a hinge loss function applied to a difference between the square of the first distance and the square of the second distance. 8. The method of claim 1 , further comprising: applying the updated mapping to determine, for each anchor audio segment of each training triplet of the plurality of training triplets, a respective updated feature vector in the n-dimensional feature space; obtaining an additional segment of sound data; applying the updated mapping to the additional segment of sound data to determine an additional feature vector in the n-dimensional feature space for the additional segment of sound data; selecting one of the updated feature vectors based on proximity, within the n-dimensional feature space, to the additional feature vector; and retrieving a segment of the training data that corresponds, via a respective one of the anchor audio segments, to the selected updated feature. 9. The method of claim 1

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Supervised learning · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Architecture, e.g. interconnection topology · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11335328B2 cover?
Methods are provided for generating training triplets that can be used to train multidimensional embeddings to represent the semantic content of non-speech sounds present in a corpus of audio recordings. These training triplets can be used with a triplet loss function to train the multidimensional embeddings such that the embeddings can be used to cluster the contents of a corpus of audio recor…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 17 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).