Neural Networks For Speaker Verification
US-2017069327-A1 · Mar 9, 2017 · US
US9824692B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9824692-B1 |
| Application number | US-201615262748-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 12, 2016 |
| Priority date | Sep 12, 2016 |
| Publication date | Nov 21, 2017 |
| Grant date | Nov 21, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.
Opening claim text (preview).
The invention claimed is: 1. A speaker recognition device comprising: a memory device storing speech samples including, dual sets of speech samples attributed to the same speaker, a cohort set of speech samples not attributed to the same speaker as the dual sets, and a set of speaker models; and a processor-based device configured to model a deep neural network with a triplet network architecture, wherein the processor-based device trains the deep neural network according to a batch process in which the dual sets of speech samples are fed through the deep neural network in combination with the cohort set of speech samples, and wherein the processor-based device feeds a recognition speech sample through the trained deep neural network, and verifies or identifies a user based on an output of the trained deep neural network responsive to the recognition speech sample and at least one of the speaker models. 2. The speaker recognition device of claim 1 , wherein the deep neural network includes, a first feed-forward neural network which receives and processes a first input to produce a first network output, a second feed-forward neural network which receives and processes a second input to produce a second network output, and a third feed-forward neural network which receives and processes a third input to produce a third network output, for each of a plurality of speakers, the memory device includes a first set of P speech samples (x 1 , . . . , x P ) attributed to the speaker and a second set of P speech samples (x 1 + , . . . , x P + ) attributed to the speaker, with P being an integer greater than or equal to 2; and the deep neural network is trained by the processor-based device such that, for each of the plurality of speakers, the deep neural network performs a batch processing during which the corresponding first set of speech samples are fed through the first feed-forward neural network, the corresponding second set of speech samples are fed through the second feed-forward neural network, and the cohort set of speech samples are fed through the third feed-forward neural network; upon completion of the batch processing, a loss function is computed based on the first network outputs, the second network outputs, and the third network outputs obtained based respectively on the corresponding first set of speech samples, the corresponding second set of speech samples, and the cohort set of speech samples; and the computed loss function is used to modify connection weights in each of the first, second, and third feed-forward neural networks according to a back propagation technique. 3. The speaker recognition device of claim 2 , wherein the loss function is based on: a positive distance d + corresponding to a degree of similarity S + between the first network output responsive to one of the first set of speech samples x i and the second network output responsive to a corresponding one of the second set of speech samples x i + , and a negative distance d − corresponding to a degree of similarity S − between the first network output responsive to the one of the first set of speech samples x i and a most similar one of the third network outputs responsive to respective speech samples of the cohort set. 4. The speaker recognition device of claim 3 , wherein the positive distance d + and the negative distance d − are determined by applying different respective margins M + , M − to the corresponding degrees of similarity S + , S − . 5. The speaker recognition device of claim 4 , wherein the loss function is defined by: Loss=Σ i=1 P L ( x i ,x i + ,X − ), where L(x i , x i + )=Ke d + /e d + +e d − , d + =2 (1−min((S + +M + ), 1), d − =2 (1−max((S + +M − −1), 0), S + −½ (1+cos (EVx i , EVx i + )), S − =½ (1+max n=1:N (cos(EVx i , EVx n − )), x n − is the n-th one of the N negative speech samples fed during the N iterations, EVx i is the first network output responsive to one of the first set of speech samples, EVx i + is the second network output responsive to one of the second set of speech samples, EVx n − is the third network output responsive to the negative speech sample x n − , M + =1−cos(π/4)/2, M − =1−cos(3π/4)/2, and K is a constant. 6. The speaker recognition device of claim 1 , wherein each of the first, second, and third feed-forward neural networks includes at least one convolutional layer and a fully connected layer. 7. The speaker recognition device of claim 6 , wherein each of the first, second, and third feed-forward neural networks further includes at least one max-pooling layer and a subsequent fully connected layer. 8. The speaker recognition device of claim 6 , wherein each speech sample, which is inputted to a respective one of the first, second, and third feedforward neural networks, is preprocessed by: partitioning an underlying speech signal into a plurality of overlapping windows; and extracting a plurality of features from each of the overlapping windows. 9. The speaker recognition device of claim 8 , wherein each of the first, second, and third feed-forward neural networks includes a first convolutional layer to receive the preprocessed speech sample, the first convolutional layer comprises a number N C of convolutional filters, each of the N C convolutional filters has F×w f neurons, where F corresponds to the height of the first convolutional layer, and w f corresponds to the width of the convolutional layer, and F is equivalent to the number of the features extracted from each of the overlapping windows, and w f is no greater than 5. 10. The speaker recognition device of claim 1 , wherein the device is configured to perform a speaker verification task in which the user inputs a self-identification, and the recognition speech sample is used to confirm that an identity of the user is the same as the self-identification. 11. The speaker recognition device of claim 1 , wherein the device is configured to perform a speaker identification task in which the recognition speech sample is used to identify the user from a plurality of potential identities stored in the memory device in association with respective speech samples. 12. The speaker recognition device of claim 1 , further comprising an input device which receives a speech sample from the user as the recognition speech sample. 13. A method comprising: training a computer-implemented model of a deep neural network with a triplet network architecture based on a plurality of speech samples stored in a memory device, the plurality of speech samples including: dual sets of speech samples attributed to the same speaker, a cohort set of speech samples not attributed to the same speaker as the dual sets, and a set of speaker models; feeding a recognition speech sample through the trained deep neural network, and verifying or identifying a user based an output of the trained deep neural network responsive to the recognition speech sample and at least one of the speaker models, wherein the training of the deep neural network is performed according to a batch process in which the dual sets of speech samples are fed through the deep neural network in combination with the cohort set of speech samples. 14. The method of claim 13 , wherein the deep neural network comprises, a first feed-forward neural network each iteration of which receives and processes a first input in order to produce a first network output, a second feed-forward neural network each iteration of whic
Learning methods · CPC title
Architecture, e.g. interconnection topology · CPC title
Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title
Training, enrolment or model building · CPC title
Interactive procedures; Man-machine interfaces · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.