Language models using spoken language modeling
US-2024386885-A1 · Nov 21, 2024 · US
US2016019883A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016019883-A1 |
| Application number | US-201414331230-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 15, 2014 |
| Priority date | Jul 15, 2014 |
| Publication date | Jan 21, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for inter-dataset variability compensation, the method comprising using at least one hardware processor for: receiving a heterogeneous development dataset comprising multiple samples and metadata associated with at least some of the multiple samples; dividing the multiple samples into multiple homogenous subsets, based on the metadata; averaging high-level features of each of the multiple homogenous subsets, to produce multiple central high-level features for the multiple homogenous subsets, respectively; computing an inter-dataset variability subspace spanned by the multiple central high-level features; removing the inter-dataset variability subspace from the high-level features of the multiple homogenous subsets, to produce denoised samples; and training a machine learning system using the denoised speech samples.
Opening claim text (preview).
What is claimed is: 1 . A method for inter-dataset variability compensation, the method comprising using at least one hardware processor for: receiving a heterogeneous development dataset comprising multiple samples and metadata associated with at least some of the multiple samples; dividing the multiple samples into multiple homogenous subsets, based on the metadata; computing a statistical measure of high-level features of each of the multiple homogenous subsets, to produce multiple central high-level features for the multiple homogenous subsets, respectively; computing an inter-dataset variability subspace spanned by the multiple central high-level features; removing the inter-dataset variability subspace from the high-level features of the multiple homogenous subsets, to produce denoised samples; and training a machine learning system using the denoised speech samples. 2 . The method according to claim 1 , wherein the high-level features are selected from the group consisting of: i-vectors, GMM (Gaussian Mixture Model) supervectors, HMM (Hidden Markov Model) supervectors, d-vectors, JFA (Joint Factor Analysis) supervectors, LBP (Local Binary Patterns), HOG (Histograms of Oriented Gradients), and EBIF (Early Biologically-Inspired Features). 3 . The method according to claim 2 , wherein the machine learning system is selected from the group consisting of: a PLDA (Probabilistic Linear Discriminant Analysis)-based system, an SVM (Support Vector Machine)-based system, a neural network-based system, a NAP (Nuisance Attribute Projection)-based system, a WCCN (Within-Speaker Covariance Matrix)-based system, and an LDA (Linear Discriminant Analysis)-based system. 4 . The method according to claim 3 , wherein the multiple samples are speech samples. 5 . The method according to claim 4 , wherein the heterogeneous development dataset is devoid of speech samples from a target domain of the speaker recognition. 6 . The method according to claim 4 , wherein the metadata comprises at least one parameter selected from the group consisting of: speaker gender, spoken language and recordation setting. 7 . The method according to claim 4 , wherein the computing of the inter-dataset variability subspace comprises PCA (Principal Component Analysis). 8 . A method for inter-dataset variability compensation for speaker recognition, the method comprising using at least one hardware processor for: receiving a heterogeneous development dataset comprising multiple speech samples; dividing the multiple speech samples into multiple homogenous subsets; for each subset i of the multiple homogenous subsets: (a) estimating PLDA (Probabilistic Linear Discriminant Analysis) hyper-parameters {μ i , B i , W i }, wherein p denotes a center of an i-vector space, B denotes a between-speaker covariance matrix and W denotes a within-speaker covariance matrix, and (b) computing an i-vector subspace S μ corresponding to {μ i }, an i-vector subspace S W corresponding to {W i }, and an i-vector subspace S B corresponding to {B i }; joining i-vector subspaces S μ , S W and S B into a single subspace S; removing subspace S from i-vectors of the multiple speech samples, to produce denoised speech samples; and training a PLDA speaker recognition system using the denoised speech samples. 9 . The method according to claim 8 , further comprising smoothing B by linear interpolation using an estimated diagonal of B. 10 . The method according to claim 8 , wherein: the heterogeneous development dataset further comprises metadata associated with at least some of the multiple speech samples; and the dividing is based on the metadata. 11 . The method according to claim 8 , wherein the computing of each of the i-vector subspaces S μ , S W and S B comprises PCA (Principal Component Analysis). 12 . The method according to claim 8 , further comprising computing an average of squared {W i }, and finding a k number of largest eigenvalues of the squared {W i }, wherein the k largest eigenvalues span the i-vector subspace S W . 13 . The method according to claim 12 , further comprising whitening the i-vector subspace S W with respect to W. 14 . The method according to claim 8 , further comprising computing an average of squared {B i }, and finding an m number of largest eigenvalues of the squared {B i }, wherein the k largest eigenvalues span the i-vector subspace S B . 15 . The method according to claim 14 , further comprising whitening the i-vector subspace S B with respect to B. 16 . A computer program product for inter-dataset variability compensation for speaker recognition, the computer program product comprising a non-transitory computer-readable storage medium having program code embodied therewith, the program code executable by at least one hardware processor to: receive a heterogeneous development dataset comprising multiple speech samples; divide the multiple speech samples into multiple homogenous subsets; for each subset i of the multiple homogenous subsets: (a) estimate PLDA (Probabilistic Linear Discriminant Analysis) hyper-parameters {μ i , B i , W i }, wherein μ denotes a center of an i-vector space, B denotes a between-speaker covariance matrix and W denotes a within-speaker covariance matrix, and (b) compute an i-vector subspace S μ corresponding to {μ i }, an i-vector subspace S W corresponding to {W i }, and an i-vector subspace S B corresponding to {B i }; join i-vector subspaces S μ , S W and S B into a single subspace S; remove subspace S from i-vectors of the multiple speech samples, to produce denoised speech samples; and train a PLDA speaker recognition system using the denoised speech samples. 17 . The computer program product according to claim 16 , wherein the program code is further executable by the at least one hardware processor to smooth B by linear interpolation using an estimated diagonal of B. 18 . The computer program product according to claim 16 , wherein: the heterogeneous development dataset further comprises metadata associated with at least some of the multiple speech samples; and the dividing is based on the metadata. 19 . The computer program product according to claim 16 , wherein the computing of each of the i-vector subspaces S μ , S W and S B comprises PCA (Principal Component Analysis). 20 . The computer program product according to claim 16 , wherein the program code is further executable by the at least one hardware processor to: compute an average of squared {W,}; find a k number of largest eigenvalues of the squared {W i }, wherein the k largest eigenvalues span the i-vector subspace S W ; whiten the i-vector subspace S W with respect to W; compute an average of squared {B i }; find an m number of largest eigenvalues of the squared {B i }, wherein the m largest eigenvalues span the i-vector subspace S b ; and whiten the i-vector subspace S B with respect to B.
Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices · CPC title
Noise filtering · CPC title
Training, enrolment or model building · CPC title
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.