Speaker-invariant training via adversarial learning
US-10347241-B1 · Jul 9, 2019 · US
US10997980B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10997980-B2 |
| Application number | US-202017081956-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 27, 2020 |
| Priority date | Oct 31, 2019 |
| Publication date | May 4, 2021 |
| Grant date | May 4, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method for determining voice characteristics, comprising: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker. 2. The method of claim 1 , wherein: training the model by minimizing the first loss function optimizes speaker classification; and training the model by minimizing the second loss function optimizes speaker clustering. 3. The method of claim 1 , wherein: the non-unit multi-variant covariance matrix comprises a standard deviation diagonal matrix. 4. The method of claim 1 , wherein: the Gaussian mixture loss function with non-unit multi-variant covariance matrix comprises a large margin Gaussian mixture loss function. 5. The method of claim 1 , wherein: the non-sampling-based loss function comprises an additive margin softmax loss function. 6. The method of claim 1 , wherein: the first loss function acts as a regularizer to the second loss function; and the second loss function acts as a regularizer to the first loss function. 7. The method of claim 1 , further comprising: obtaining the one or more voice characteristics for each of one or more speakers; obtaining the one or more voice characteristics for a candidate user; comparing the one or more voice characteristics of the candidate user with the one or more voice characteristics of the each of the one or more speakers; and identifying whether the candidate user is any of the one or more speakers based at least on the comparison. 8. The method of claim 1 , further comprising: obtaining the one or more voice characteristics for a candidate user; comparing the one or more voice characteristics of the candidate user with the one or more voice characteristics of the speaker; and verifying whether the candidate user is the speaker based at least on the comparison. 9. The method of claim 7 , wherein: comparing the one or more voice characteristics of the candidate user with the one or more voice characteristics of the each of the one or more speakers comprises: comparing, with a threshold, a distance between a vector representing the one or more voice characteristics of the candidate user and a different vector representing the one or more voice characteristics of the each of the one or more speakers. 10. The method of claim 1 , wherein: obtaining the speech data of the speaker comprises obtaining a spectrogram corresponding to the speech data, and obtaining a plurality of feature vectors corresponding to the spectrogram; and inputting the speech data into the trained model comprises inputting the plurality of feature vectors into the trained model. 11. The method of claim 10 , wherein the trained model comprises: a first convolution layer configured to receive the plurality of feature vectors as an input of the first convolution layer; a first pooling layer configured to receive an output of the first convolution layer as an input of the first pooling layer; a plurality of residual network layers configured to receive an output of the first pooling layer as an input of the plurality of residual network layers; a second convolution layer configured to receive an output of the plurality of residual network layers as an input of the second convolution layer; a second pooling layer configured to receive an output of the second convolution layer as an input of the second pooling layer; and an embedding layer configured to receive an output of the second pooling layer as an input of the embedding layer and output a vector representing the one or more voice characteristics of the speaker. 12. The method of claim 11 , wherein: minimizing the first loss function comprises, for at least the embedding layer, minimizing a non-sampling-based loss function to optimize between-class classification error; and minimizing the second loss function comprises, for at least the embedding layer, minimizing a Gaussian mixture loss function with non-unit multi-variant covariance matrix to reduce intra-class variation. 13. The method of claim 1 , wherein: minimizing the first loss function comprises increasing a margin linearly from zero to a target margin value for annealing. 14. A non-transitory computer-readable storage medium storing instructions executable by one or more processors, wherein execution of the instructions causes the one or more processors to perform operations comprising: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker. 15. The non-transitory computer-readable storage medium of claim 14 , wherein: training the model by minimizing the first loss function optimizes speaker classification; and training the model by minimizing the second loss function optimizes speaker clustering. 16. The non-transitory computer-readable storage medium of claim 14 , wherein: the non-unit multi-variant covariance matrix comprises a standard deviation diagonal matrix. 17. The non-transitory computer-readable storage medium of claim 14 , wherein: the Gaussian mixture loss function with non-unit multi-variant covariance matrix comprises a large margin Gaussian mixture loss function. 18. The non-transitory computer-readable storage medium of claim 14 , wherein: the non-sampling-based loss function comprises an additive margin softmax loss function. 19. The non-transitory computer-readable storage medium of claim 14 , wherein: the first loss function acts as a regularizer to the second loss function; and the second loss function acts as a regularizer to the first loss function. 20. A system comprising one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors and configured with instructions executable by the one or more processors to cause the system to perform operations comprising: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.
Artificial neural networks; Connectionist approaches · CPC title
Training, enrolment or model building · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions · CPC title
Decision making techniques; Pattern matching strategies · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.