Apparatus and method for large vocabulary continuous speech recognition

US9805716B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9805716-B2
Application numberUS-201615042309-A
CountryUS
Kind codeB2
Filing dateFeb 12, 2016
Priority dateFeb 12, 2015
Publication dateOct 31, 2017
Grant dateOct 31, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is an apparatus for large vocabulary continuous speech recognition (LVCSR) based on a context-dependent deep neural network hidden Markov model (CD-DNN-HMM) algorithm. The apparatus may include an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm, and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information.

First claim

Opening claim text (preview).

What is claimed is: 1. A speech recognition apparatus, comprising: an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm; and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information, wherein the extractor is configured to obtain the acoustic model-state level information by applying at least one of the first feature vector and the second feature vector to a Gaussian mixture model-hidden Markov model (GMM-HMM) technology based training data model set, and wherein the first feature vector is determined based on an intra-frame feature value associated with a feature of a rapid change in a spectrum of the input speech signal, a static feature value associated with a static feature of the input speech signal, and an inter-frame feature value associated with a dynamic feature of the input speech signal based on a lapse of a period of time. 2. The apparatus of claim 1 , wherein the second feature vector is obtained by directly learning an equation needed for extracting a feature from the training data model set based on a deep neural network (DNN) algorithm. 3. The apparatus of claim 1 , further comprising: a preprocessor configured to eliminate background noise from at least one set of training data comprised in the training data model set. 4. The apparatus of claim 3 , wherein the preprocessor comprises: a measurer configured to divide the training data into preset frame units and measure an energy value of each of the frame units; and a determiner configured to determine the training data to be clean data in response to a mean of measured energy values being less than a first threshold value, and determine the training data to be noisy data in response to the mean of the measured energy values being greater than or equal to the first threshold value. 5. The apparatus of claim 4 , wherein the determiner is configured to calculate a deviation between energy values of the training data determined to be the noisy data, determine the training data to be stationary noisy data in response to the calculated deviation being less than a second threshold value, and determine the training data to be non-stationary noisy data in response to the calculated deviation being greater than or equal to the second threshold value. 6. The apparatus of claim 5 , wherein the preprocessor is configured to eliminate the stationary noisy data using single channel speech enhancement technology in response to the training data determined to be the stationary noisy data, and eliminate the non-stationary noisy data using signal channel speech separation technology in response to the training data determined to be the non-stationary noisy data. 7. The apparatus of claim 1 , wherein the extractor is configured to extract the acoustic model-state level information additionally using a third feature vector comprising at least one of a spectral entropy based additional feature, a harmonic component ratio based additional feature, and a pitch information based additional feature. 8. A speech recognition apparatus, comprising: a preprocessor configured to eliminate background noise from at least one set of training data comprised in a training data model set; an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from the training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm; and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information, wherein the extractor is configured to obtain the acoustic model-state level information by applying at least one of the first feature vector and the second feature vector to a Gaussian mixture model-hidden Markov model (GMM-HMM) technology based training data model set, and wherein the first feature vector is determined based on an intra-frame feature value associated with a feature of a rapid change in a spectrum of the input speech signal, a static feature value associated with a static feature of the input speech signal, and an inter-frame feature value associated with a dynamic feature of the input speech signal based on a lapse of a period of time. 9. The apparatus of claim 8 , wherein the preprocessor is configured to determine the training data to be one of clean data, stationary noisy data, and non-stationary noisy data using an energy value measured for each of preset frame units of the training data. 10. The apparatus of claim 9 , wherein the preprocessor is configured to eliminate the stationary noisy data using single channel speech enhancement technology in response to the training data determined to be the stationary noisy data, and eliminate the non-stationary noisy data using single channel speech separation technology in response to the training data determined to be the non-stationary noisy data. 11. The apparatus of claim 8 , wherein the extractor is configured to extract the acoustic model-state level information additionally using a third feature vector comprising at least one of a spectral entropy based additional feature, a harmonic component ratio based additional feature, and a pitch information based additional feature. 12. A speech recognition method, comprising: extracting acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm; and providing a result of recognizing the input speech signal based on the extracted acoustic model-state level information, wherein the extracting of the acoustic model-state level information includes obtaining the acoustic model-state level information by applying at least one of the first feature vector and the second feature vector to a Gaussian mixture model-hidden Markov model (GMM-HMM) technology based training data model set, wherein the first feature vector is determined based on an intra-frame feature value associated with a feature of a rapid change in a spectrum of the input speech signal, a static feature value associated with a static feature of the input speech signal, and an inter-frame feature value associated with a dynamic feature of the input speech signal based on a lapse of a period of time, and the second feature vector is obtained by directly learning an equation needed for extracting a feature from the training data model set based on a deep neural network (DNN) algorithm. 13. The method of claim 12 , further comprising: eliminating background noise from at least one set of training data comprised in the training data model set. 14. The method of claim 13 , wherein the eliminating of the background noise comprises: determining the training data to be one of clean data, stationary noisy data, and non-stationary noisy data using an energy value measured for each of preset frame units of the training data; and eliminating the stationary noisy data using single channel speech enhancement technology in response to the training data determined to be the stationary noisy data, and eliminating the non-stationary noisy data usin

Assignees

Inventors

Classifications

  • G10L15/142Primary

    Hidden Markov Models [HMMs] · CPC title

  • using artificial neural networks · CPC title

  • Speech enhancement, e.g. noise reduction or echo cancellation (reducing echo effects in line transmission systems H04B3/20; echo suppression in hands-free telephones H04M9/08) · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9805716B2 cover?
Provided is an apparatus for large vocabulary continuous speech recognition (LVCSR) based on a context-dependent deep neural network hidden Markov model (CD-DNN-HMM) algorithm. The apparatus may include an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector base…
Who is the assignee on this patent?
Electronics & Telecommunications Res Inst
What technology area does this patent fall under?
Primary CPC classification G10L15/142. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 31 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).