Apparatus and method for recognizing continuous speech
US-2015006175-A1 · Jan 1, 2015 · US
US9805716B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9805716-B2 |
| Application number | US-201615042309-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 12, 2016 |
| Priority date | Feb 12, 2015 |
| Publication date | Oct 31, 2017 |
| Grant date | Oct 31, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Provided is an apparatus for large vocabulary continuous speech recognition (LVCSR) based on a context-dependent deep neural network hidden Markov model (CD-DNN-HMM) algorithm. The apparatus may include an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm, and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information.
Opening claim text (preview).
What is claimed is: 1. A speech recognition apparatus, comprising: an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm; and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information, wherein the extractor is configured to obtain the acoustic model-state level information by applying at least one of the first feature vector and the second feature vector to a Gaussian mixture model-hidden Markov model (GMM-HMM) technology based training data model set, and wherein the first feature vector is determined based on an intra-frame feature value associated with a feature of a rapid change in a spectrum of the input speech signal, a static feature value associated with a static feature of the input speech signal, and an inter-frame feature value associated with a dynamic feature of the input speech signal based on a lapse of a period of time. 2. The apparatus of claim 1 , wherein the second feature vector is obtained by directly learning an equation needed for extracting a feature from the training data model set based on a deep neural network (DNN) algorithm. 3. The apparatus of claim 1 , further comprising: a preprocessor configured to eliminate background noise from at least one set of training data comprised in the training data model set. 4. The apparatus of claim 3 , wherein the preprocessor comprises: a measurer configured to divide the training data into preset frame units and measure an energy value of each of the frame units; and a determiner configured to determine the training data to be clean data in response to a mean of measured energy values being less than a first threshold value, and determine the training data to be noisy data in response to the mean of the measured energy values being greater than or equal to the first threshold value. 5. The apparatus of claim 4 , wherein the determiner is configured to calculate a deviation between energy values of the training data determined to be the noisy data, determine the training data to be stationary noisy data in response to the calculated deviation being less than a second threshold value, and determine the training data to be non-stationary noisy data in response to the calculated deviation being greater than or equal to the second threshold value. 6. The apparatus of claim 5 , wherein the preprocessor is configured to eliminate the stationary noisy data using single channel speech enhancement technology in response to the training data determined to be the stationary noisy data, and eliminate the non-stationary noisy data using signal channel speech separation technology in response to the training data determined to be the non-stationary noisy data. 7. The apparatus of claim 1 , wherein the extractor is configured to extract the acoustic model-state level information additionally using a third feature vector comprising at least one of a spectral entropy based additional feature, a harmonic component ratio based additional feature, and a pitch information based additional feature. 8. A speech recognition apparatus, comprising: a preprocessor configured to eliminate background noise from at least one set of training data comprised in a training data model set; an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from the training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm; and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information, wherein the extractor is configured to obtain the acoustic model-state level information by applying at least one of the first feature vector and the second feature vector to a Gaussian mixture model-hidden Markov model (GMM-HMM) technology based training data model set, and wherein the first feature vector is determined based on an intra-frame feature value associated with a feature of a rapid change in a spectrum of the input speech signal, a static feature value associated with a static feature of the input speech signal, and an inter-frame feature value associated with a dynamic feature of the input speech signal based on a lapse of a period of time. 9. The apparatus of claim 8 , wherein the preprocessor is configured to determine the training data to be one of clean data, stationary noisy data, and non-stationary noisy data using an energy value measured for each of preset frame units of the training data. 10. The apparatus of claim 9 , wherein the preprocessor is configured to eliminate the stationary noisy data using single channel speech enhancement technology in response to the training data determined to be the stationary noisy data, and eliminate the non-stationary noisy data using single channel speech separation technology in response to the training data determined to be the non-stationary noisy data. 11. The apparatus of claim 8 , wherein the extractor is configured to extract the acoustic model-state level information additionally using a third feature vector comprising at least one of a spectral entropy based additional feature, a harmonic component ratio based additional feature, and a pitch information based additional feature. 12. A speech recognition method, comprising: extracting acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm; and providing a result of recognizing the input speech signal based on the extracted acoustic model-state level information, wherein the extracting of the acoustic model-state level information includes obtaining the acoustic model-state level information by applying at least one of the first feature vector and the second feature vector to a Gaussian mixture model-hidden Markov model (GMM-HMM) technology based training data model set, wherein the first feature vector is determined based on an intra-frame feature value associated with a feature of a rapid change in a spectrum of the input speech signal, a static feature value associated with a static feature of the input speech signal, and an inter-frame feature value associated with a dynamic feature of the input speech signal based on a lapse of a period of time, and the second feature vector is obtained by directly learning an equation needed for extracting a feature from the training data model set based on a deep neural network (DNN) algorithm. 13. The method of claim 12 , further comprising: eliminating background noise from at least one set of training data comprised in the training data model set. 14. The method of claim 13 , wherein the eliminating of the background noise comprises: determining the training data to be one of clean data, stationary noisy data, and non-stationary noisy data using an energy value measured for each of preset frame units of the training data; and eliminating the stationary noisy data using single channel speech enhancement technology in response to the training data determined to be the stationary noisy data, and eliminating the non-stationary noisy data usin
Hidden Markov Models [HMMs] · CPC title
using artificial neural networks · CPC title
Speech enhancement, e.g. noise reduction or echo cancellation (reducing echo effects in line transmission systems H04B3/20; echo suppression in hands-free telephones H04M9/08) · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.