Speaker-invariant training via adversarial learning
US-10347241-B1 · Jul 9, 2019 · US
US11031018B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11031018-B2 |
| Application number | US-202017131182-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 22, 2020 |
| Priority date | Oct 31, 2019 |
| Publication date | Jun 8, 2021 |
| Grant date | Jun 8, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method for personalized speaker verification, comprising: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample, wherein the first speech data comprises one or more speech segments of the speaker, and wherein the second speech data comprises one or more speech segments of one or more people other than the speaker; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker by at least feeding the one or more speech segments of the speaker into the first model to correspondingly output one or more positive sample vectors, and feeding the one or more speech segments of the one or more people other than the speaker into the first model to correspondingly output one or more negative sample vectors; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification. 2. The method of claim 1 , further comprising: averaging the one or more positive sample vectors to obtain a template vector of the speaker. 3. The method of claim 2 , further comprising: obtaining speech data of a user; feeding the obtained speech data to the second model to obtain an input vector of the user; comparing the input vector of the user with the template vector of the speaker; and verifying if the user is the speaker based at least on the comparison. 4. The method of claim 1 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic comprises: feeding the one or more positive sample vectors and the one or more negative sample vectors into a neural network classifier to obtain one or more gradient vectors. 5. The method of claim 4 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic further comprises: averaging the one or more gradient vectors to obtain an average gradient vector of the speaker as the gradient. 6. The method of claim 5 , wherein: feeding the gradient to the first model to update the one or more parameters of the first model comprises: feeding the average gradient vector of the speaker to the first model to update the one or more parameters of the first model; and the one or more parameters associate different neural layers of the first model. 7. The method of claim 4 , wherein feeding the one or more positive sample vectors and the one or more negative sample vectors into the neural network classifier to obtain one or more gradient vectors comprises: obtaining the gradient based at least on backward propagation through a cross-entropy loss function of the neural network classifier. 8. The method of claim 1 , wherein feeding the gradient to the first model to update the one or more parameters of the first model comprises: feeding the gradient to the first model to update the one or more parameters of the first model based at least on the gradient and an online machine learning rate. 9. The method of claim 8 , feeding the gradient to the first model to update the one or more parameters of the first model based at least on the gradient and the online machine learning rate comprises: updating the one or more parameters in a direction in which the gradient descents at a fastest online machine learning rate. 10. The method of claim 1 , wherein: before feeding the positive sample and the negative sample to the first model for determining voice characteristics, the first model has been trained at least by jointly minimizing a first loss function that optimizes speaker classification and a second loss function that optimizes speaker clustering. 11. The method of claim 10 , wherein: the first loss function comprises a non-sampling-based loss function; and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix. 12. One or more non-transitory computer-readable storage media storing instructions executable by one or more processors, wherein execution of the instructions causes the one or more processors to perform operations comprising: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample, wherein the first speech data comprises one or more speech segments of the speaker, and wherein the second speech data comprises one or more speech segments of one or more people other than the speaker; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker by at least feeding the one or more speech segments of the speaker into the first model to correspondingly output one or more positive sample vectors, and feeding the one or more speech segments of the one or more people other than the speaker into the first model to correspondingly output one or more negative sample vectors; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification. 13. The one or more non-transitory computer-readable storage media of claim 12 , wherein the operations further comprise: averaging the one or more positive sample vectors to obtain a template vector of the speaker. 14. The one or more non-transitory computer-readable storage media of claim 13 , wherein the operations further comprise: obtaining speech data of a user; feeding the obtained speech data to the second model to obtain an input vector of the user; comparing the input vector of the user with the template vector of the speaker; and verifying if the user is the speaker based at least on the comparison. 15. The one or more non-transitory computer-readable storage media of claim 12 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic comprises: feeding the one or more positive sample vectors and the one or more negative sample vectors into a neural network classifier to obtain one or more gradient vectors. 16. The one or more non-transitory computer-readable storage media of claim 15 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic further comprises: averaging the one or more gradient vectors to obtain an average gradient vector of the speaker as the gradient. 17. The one or more non-transitory computer-readable storage media of claim 16 , wherein: feeding the gradient to the first model to update the one or more parameters of the first model comprises: feeding the average gradient vector of the speaker to the first model to update the one or more parameters of the first model; and the one or more parameters associate different neural layers of the first model. 18. A system comprising one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors and configured with instruc
Training, enrolment or model building · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions · CPC title
Decision making techniques; Pattern matching strategies · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.