Speaker verification using neural networks
US-2015127336-A1 · May 7, 2015 · US
US9978374B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9978374-B2 |
| Application number | US-201514846187-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 4, 2015 |
| Priority date | Sep 4, 2015 |
| Publication date | May 22, 2018 |
| Grant date | May 22, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This document generally describes systems, methods, devices, and other techniques related to speaker verification, including (i) training a neural network for a speaker verification model, (ii) enrolling users at a client device, and (iii) verifying identities of users based on characteristics of the users' voices. Some implementations include a computer-implemented method. The method can include receiving, at a computing device, data that characterizes an utterance of a user of the computing device. A speaker representation can be generated, at the computing device, for the utterance using a neural network on the computing device. The neural network can be trained based on a plurality of training samples that each: (i) include data that characterizes a first utterance and data that characterizes one or more second utterances, and (ii) are labeled as a matching speakers sample or a non-matching speakers sample.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving, at a computing device, data that characterizes an utterance of a user of the computing device; generating, at the computing device, a speaker representation for the utterance using a neural network on the computing device that was trained based on a plurality of training samples in a training procedure, wherein each training sample of the plurality of training samples includes (i) a first training input component that characterizes a first utterance, (ii) a second training input component that characterizes one or more second utterances, and (iii) a first classification for the training sample that indicates whether a speaker of the first utterance is the same or different from a speaker of the one or more second utterances, wherein the training procedure includes, for each training sample of the plurality of training samples: (i) generating a second classification for the training sample that indicates whether the speaker of the first utterance is the same or different from a speaker of the one or more second utterances, the second classification based on an output that results from processing the first training input component and the second training input component with the neural network, and (ii) adjusting parameters of the neural network based on comparison of the first classification for the training sample and the second classification for the training sample; accessing, at the computing device, a speaker model for an authorized user of the computing device; evaluating, at the computing device, the speaker representation for the utterance with respect to the speaker model to determine whether the utterance was likely spoken by the authorized user of the computing device; and performing, at the computing device, an operation that is selected based on whether the utterance is determined to have been likely spoken by the authorized user of the computing device. 2. The computer-implemented method of claim 1 , wherein each of the plurality of training samples was generated by selecting the first utterance and the one or more second utterances from groups of utterances that correspond to different speakers, such that each group of utterances consists only of utterances of the corresponding speaker for the respective group of utterances. 3. The computer-implemented method of claim 1 , further comprising: obtaining a set of utterances of the authorized user of the computing device; inputting each utterance from the set of utterances into the neural network to generate a respective speaker representation for the utterance; and generating the speaker model for the authorized user of the computing device based on an average of the respective speaker representations for the utterances in the set of utterances of the authorized user. 4. The computer-implemented method of claim 1 , wherein none of the plurality of training samples on which the neural network has been trained includes data that characterizes the utterance of the user of the computing device. 5. The computer-implemented method of claim 1 , wherein generating, at the computing device, the speaker representation for the utterance comprises processing data that characterizes an entirety of the utterance with the neural network in a single pass through the neural network. 6. The computer-implemented method of claim 1 , further comprising determining that the utterance was likely spoken by the authorized user of the computing device, wherein performing the operation that is selected based on whether the utterance is determined to have been likely spoken by the authorized user of the computing device comprises authenticating an identity of the user that submitted the utterance. 7. The computer-implemented method of claim 1 , further comprising determining that the utterance was likely spoken by the authorized user of the computing device, wherein performing the operation that is selected based on whether the utterance is determined to have been likely spoken by the authorized user of the computing device comprises transitioning the computing device from a locked state to an unlocked state. 8. One or more non-transitory computer-readable media having instructions stored thereon that, when executed by one or more processors of a computing device, cause performance of operations comprising: receiving, at the computing device, data that characterizes an utterance of a user of the computing device; generating, at the computing device, a speaker representation for the utterance using a neural network on the computing device that was trained based on a plurality of training samples in a training procedure, wherein each training sample of the plurality of training samples includes (i) a first training input component that characterizes a first utterance, (ii) a second training input component that characterizes one or more second utterances, and (iii) a first classification for the training sample that indicates whether a speaker of the first utterance is the same or different from a speaker of the one or more second utterances, wherein the training procedure includes, for each training sample of the plurality of training samples: (i) generating a second classification for the training sample that indicates whether the speaker of the first utterance is the same or different from a speaker of the one or more second utterances, the second classification based on an output that results from processing the first training input component and the second training input component with the neural network, and (ii) adjusting parameters of the neural network based on comparison of the first classification for the training sample and the second classification for the training sample; accessing, at the computing device, a speaker model for an authorized user of the computing device; evaluating, at the computing device, the speaker representation for the utterance with respect to the speaker model to determine whether the utterance was likely spoken by the authorized user of the computing device; and performing, at the computing device, an operation that is selected based on whether the utterance is determined to have been likely spoken by the authorized user of the computing device. 9. The non-transitory computer-readable media of claim 8 , wherein each of the plurality of training samples was generated by selecting the first utterance and the one or more second utterances from groups of utterances that correspond to different speakers, such that each group of utterances consists only of utterances of the corresponding speaker for the respective group of utterances. 10. The non-transitory computer-readable media of claim 8 , wherein the operations further comprise: obtaining a set of utterances of the authorized user of the computing device; inputting each utterance from the set of utterances into the neural network to generate a respective speaker representation for the utterance; and generating the speaker model for the authorized user of the computing device based on an average of the respective speaker representations for the utterances in the set of utterances of the authorized user. 11. The non-transitory computer-readable media of claim 8 , wherein none of the plurality of training samples on which the neural network has been trained includes data that characterizes the utterance of the user of the computing device. 12. The non-transitory computer-readable media of claim 8 , wherein generating, at the computing device, the speaker representation for the utterance comprises processing data that characterizes an entirety of the utteranc
Artificial neural networks; Connectionist approaches · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Training, enrolment or model building · CPC title
using biometric data, e.g. fingerprints, iris scans or voice recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.