Neural Networks For Speaker Verification
US-2017069327-A1 · Mar 9, 2017 · US
US10381009B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10381009-B2 |
| Application number | US-201715818231-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 20, 2017 |
| Priority date | Sep 12, 2016 |
| Publication date | Aug 13, 2019 |
| Grant date | Aug 13, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.
Opening claim text (preview).
The invention claimed is: 1. A speaker recognition device including a processor-based device having been configured to model a trained deep neural network with a triplet network architecture, the deep neural network having been trained according to a process in which dual sets of speech samples are fed through the deep neural network in combination with a cohort set of speech samples not attributed to the same speaker as the dual sets, comprising: a memory device storing speech samples including a set of speaker models; and the processor-based device feeding a recognition speech sample through the trained deep neural network, and verifying or identifying a user based on an output of the trained deep neural network responsive to the recognition speech sample and at least one of the speaker models. 2. The speaker recognition device of claim 1 , wherein the deep neural network includes, a first feed-forward neural network which receives and processes a first input to produce a first network output, a second feed-forward neural network which receives and processes a second input to produce a second network output, and a third feed-forward neural network which receives and processes a third input to produce a third network output. 3. The speaker recognition device of claim 2 , wherein each of the first, second, and third feed-forward neural networks includes at least one convolutional layer and a fully connected layer. 4. The speaker recognition device of claim 3 , wherein each of the first, second, and third feed-forward neural networks further includes at least one max-pooling layer and a subsequent fully connected layer. 5. The speaker recognition device of claim 3 , wherein each speech sample, which is inputted to a respective one of the first, second, and third feedforward neural networks, is preprocessed by: partitioning an underlying speech signal into a plurality of overlapping windows; and extracting a plurality of features from each of the overlapping windows. 6. The speaker recognition device of claim 5 , wherein each of the first, second, and third feed-forward neural networks includes a first convolutional layer to receive the preprocessed speech sample, the first convolutional layer comprises a number N C of convolutional filters, each of the N C convolutional filters has F×w f neurons, where F corresponds to the height of the first convolutional layer, and w f corresponds to the width of the convolutional layer, and F is equivalent to the number of the features extracted from each of the overlapping windows. 7. The speaker recognition device of claim 1 , wherein the device is configured to perform a speaker verification task in which the user inputs a self-identification, and the recognition speech sample is used to confirm that an identity of the user is the same as the self-identification. 8. The speaker recognition device of claim 1 , wherein the device is configured to perform a speaker identification task in which the recognition speech sample is used to identify the user from a plurality of potential identities stored in the memory device in association with respective speech samples. 9. The speaker recognition device of claim 1 , further comprising an input device which receives a speech sample from the user as the recognition speech sample. 10. A method of using a speaker recognition device including a processor-based device having been configured to model a trained deep neural network with a triplet network architecture, the deep neural network having been trained according to a process in which dual sets of speech samples are fed through the deep neural network in combination with a cohort set of speech samples not attributed to the same speaker as the dual sets, the method comprising: storing speech samples including a set of speaker models; and feeding a recognition speech sample through the trained deep neural network, and verifying or identifying a user based on an output of the trained deep neural network responsive to the recognition speech sample and at least one of the speaker models. 11. The method of claim 10 , further comprising preprocessing each speech sample by: partitioning an underlying speech signal into a plurality of overlapping windows; and extracting a plurality of features from each of the overlapping windows. 12. The method of claim 10 , further comprising: performing a speaker verification task in which the user inputs a self-identification, and the recognition speech sample is used to confirm that an identity of the user is the same as the self-identification. 13. The method of claim 10 , further comprising: performing a speaker identification task in which the recognition speech sample is used to identify the user from a plurality of stored potential identities in association with respective speech samples. 14. The method of claim 10 , further comprising: receiving a speech sample from the user as the recognition speech sample.
Learning methods · CPC title
Architecture, e.g. interconnection topology · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Interactive procedures; Man-machine interfaces · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.