End-to-end speaker recognition using deep neural network

US9824692B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9824692-B1
Application numberUS-201615262748-A
CountryUS
Kind codeB1
Filing dateSep 12, 2016
Priority dateSep 12, 2016
Publication dateNov 21, 2017
Grant dateNov 21, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

First claim

Opening claim text (preview).

The invention claimed is: 1. A speaker recognition device comprising: a memory device storing speech samples including, dual sets of speech samples attributed to the same speaker, a cohort set of speech samples not attributed to the same speaker as the dual sets, and a set of speaker models; and a processor-based device configured to model a deep neural network with a triplet network architecture, wherein the processor-based device trains the deep neural network according to a batch process in which the dual sets of speech samples are fed through the deep neural network in combination with the cohort set of speech samples, and wherein the processor-based device feeds a recognition speech sample through the trained deep neural network, and verifies or identifies a user based on an output of the trained deep neural network responsive to the recognition speech sample and at least one of the speaker models. 2. The speaker recognition device of claim 1 , wherein the deep neural network includes, a first feed-forward neural network which receives and processes a first input to produce a first network output, a second feed-forward neural network which receives and processes a second input to produce a second network output, and a third feed-forward neural network which receives and processes a third input to produce a third network output, for each of a plurality of speakers, the memory device includes a first set of P speech samples (x 1 , . . . , x P ) attributed to the speaker and a second set of P speech samples (x 1 + , . . . , x P + ) attributed to the speaker, with P being an integer greater than or equal to 2; and the deep neural network is trained by the processor-based device such that, for each of the plurality of speakers, the deep neural network performs a batch processing during which the corresponding first set of speech samples are fed through the first feed-forward neural network, the corresponding second set of speech samples are fed through the second feed-forward neural network, and the cohort set of speech samples are fed through the third feed-forward neural network; upon completion of the batch processing, a loss function is computed based on the first network outputs, the second network outputs, and the third network outputs obtained based respectively on the corresponding first set of speech samples, the corresponding second set of speech samples, and the cohort set of speech samples; and the computed loss function is used to modify connection weights in each of the first, second, and third feed-forward neural networks according to a back propagation technique. 3. The speaker recognition device of claim 2 , wherein the loss function is based on: a positive distance d + corresponding to a degree of similarity S + between the first network output responsive to one of the first set of speech samples x i and the second network output responsive to a corresponding one of the second set of speech samples x i + , and a negative distance d − corresponding to a degree of similarity S − between the first network output responsive to the one of the first set of speech samples x i and a most similar one of the third network outputs responsive to respective speech samples of the cohort set. 4. The speaker recognition device of claim 3 , wherein the positive distance d + and the negative distance d − are determined by applying different respective margins M + , M − to the corresponding degrees of similarity S + , S − . 5. The speaker recognition device of claim 4 , wherein the loss function is defined by: Loss=Σ i=1 P L ( x i ,x i + ,X − ), where L(x i , x i + )=Ke d + /e d + +e d − , d + =2 (1−min((S + +M + ), 1), d − =2 (1−max((S + +M − −1), 0), S + −½ (1+cos (EVx i , EVx i + )), S − =½ (1+max n=1:N (cos(EVx i , EVx n − )), x n − is the n-th one of the N negative speech samples fed during the N iterations, EVx i is the first network output responsive to one of the first set of speech samples, EVx i + is the second network output responsive to one of the second set of speech samples, EVx n − is the third network output responsive to the negative speech sample x n − , M + =1−cos(π/4)/2, M − =1−cos(3π/4)/2, and K is a constant. 6. The speaker recognition device of claim 1 , wherein each of the first, second, and third feed-forward neural networks includes at least one convolutional layer and a fully connected layer. 7. The speaker recognition device of claim 6 , wherein each of the first, second, and third feed-forward neural networks further includes at least one max-pooling layer and a subsequent fully connected layer. 8. The speaker recognition device of claim 6 , wherein each speech sample, which is inputted to a respective one of the first, second, and third feedforward neural networks, is preprocessed by: partitioning an underlying speech signal into a plurality of overlapping windows; and extracting a plurality of features from each of the overlapping windows. 9. The speaker recognition device of claim 8 , wherein each of the first, second, and third feed-forward neural networks includes a first convolutional layer to receive the preprocessed speech sample, the first convolutional layer comprises a number N C of convolutional filters, each of the N C convolutional filters has F×w f neurons, where F corresponds to the height of the first convolutional layer, and w f corresponds to the width of the convolutional layer, and F is equivalent to the number of the features extracted from each of the overlapping windows, and w f is no greater than 5. 10. The speaker recognition device of claim 1 , wherein the device is configured to perform a speaker verification task in which the user inputs a self-identification, and the recognition speech sample is used to confirm that an identity of the user is the same as the self-identification. 11. The speaker recognition device of claim 1 , wherein the device is configured to perform a speaker identification task in which the recognition speech sample is used to identify the user from a plurality of potential identities stored in the memory device in association with respective speech samples. 12. The speaker recognition device of claim 1 , further comprising an input device which receives a speech sample from the user as the recognition speech sample. 13. A method comprising: training a computer-implemented model of a deep neural network with a triplet network architecture based on a plurality of speech samples stored in a memory device, the plurality of speech samples including: dual sets of speech samples attributed to the same speaker, a cohort set of speech samples not attributed to the same speaker as the dual sets, and a set of speaker models; feeding a recognition speech sample through the trained deep neural network, and verifying or identifying a user based an output of the trained deep neural network responsive to the recognition speech sample and at least one of the speaker models, wherein the training of the deep neural network is performed according to a batch process in which the dual sets of speech samples are fed through the deep neural network in combination with the cohort set of speech samples. 14. The method of claim 13 , wherein the deep neural network comprises, a first feed-forward neural network each iteration of which receives and processes a first input in order to produce a first network output, a second feed-forward neural network each iteration of whic

Assignees

Inventors

Classifications

  • Learning methods · CPC title

  • Architecture, e.g. interconnection topology · CPC title

  • G10L17/08Primary

    Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title

  • G10L17/04Primary

    Training, enrolment or model building · CPC title

  • Interactive procedures; Man-machine interfaces · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9824692B1 cover?
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained a…
Who is the assignee on this patent?
Pindrop Security Inc
What technology area does this patent fall under?
Primary CPC classification G10L17/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 21 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).