Building conversational understanding systems using a toolset
US-2016203125-A1 · Jul 14, 2016 · US
US9520127B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9520127-B2 |
| Application number | US-201414265110-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 29, 2014 |
| Priority date | Apr 29, 2014 |
| Publication date | Dec 13, 2016 |
| Grant date | Dec 13, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.
Opening claim text (preview).
What is claimed is: 1. A method of providing a framework for merging two or more automatic speech recognition (ASR) system having a shared deep neural network (DNN) feature transformation, comprising: receiving, by a computing device, at least one utterance; training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprising a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden later output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR; combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and training, for the merged system, senone coefficient data for evaluation of spoken utterances. 2. The method of claim 1 , further comprising receiving a spoken utterance, and executing ASR recognition for the spoken utterance using the merged system. 3. The method of claim 2 , wherein the senone coefficient data is used to evaluate the spoken utterance to determine ASR results. 4. The method of claim 1 , wherein receiving, by a computing device, at least one utterance comprises receiving a plurality of training utterances for speech recognition. 5. The method of claim 1 , wherein the training of the at least one utterance comprises: training the first ASR system with a cross entropy criterion, the first ASR system comprising a DNN system; and deriving the DNN feature transformation from a top hidden layer of the DNN system. 6. The method of claim 1 , wherein the training of the at least one utterance comprises: training the first ASR system with sequential training criterion, the first ASR system comprising a DNN system; and deriving the DNN feature transformation from a top hidden layer of the DNN system. 7. The method of claim 1 , wherein utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer comprises generating a network comprising a low dimension bottleneck hidden layer and a plurality of senones. 8. The method of claim 1 , wherein generating, by the computing device, the first score and the second score comprises generating log likelihood scores from a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system. 9. The limitation of claim 1 , wherein combining, by the computing device, the first score and the second score comprises performing a linear combination of the first score from the first ASR system and the second score from the second ASR system. 10. The method of claim 1 , wherein combining, by the computing device, the first score and the second score comprises performing a non-linear combination of the first score from the first ASR system and the second score from the second ASR system. 11. A system comprising: at least one processor; and a memory operatively connected with the at least one processor, wherein the memory stores computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises: receiving, by a computing device, at least one utterance; training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprises a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR system; combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and training, for the merged system, senone coefficient data for evaluation of spoken utterances. 12. The system according to claim 11 , wherein the method, executed by the at least one processor, further comprises receiving a spoken utterance, and executing ASR recognition for the spoken utterance using the merged system. 13. The system according to claim 11 , wherein the training of the at least one utterance comprises: training the first ASR system with at least one of a cross entropy criterion and a sequential training criterion, and deriving the DNN feature transformation from a top hidden layer of a DNN system. 14. The system according to claim 11 , wherein the generating of the first score and the second score further comprises generating log likelihood scores from a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system. 15. The system according to claim 11 , wherein the combining of the first score and the second score occurs by executing at least one selected from a group consisting of: performing a non-linear combination of the first score and the second score, and performing a linear combination of the first score and the second score. 16. A computer-readable storage device storing computer executable instructions which, when executed by a computer, cause computer to perform a method of providing a framework for merging systems having a shared deep neural network (DNN) feature transformation, the method comprising: receiving a plurality of training utterances for speech recognition; training a first system with one or more of a cross entropy criterion and a sequential training criterion utilizing the plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers; generating an output from a top hidden layer in the plurality of hidden layers for the plurality of training utterances; utilizing the top hidden layer output to generate a network comprising a low dimension bottleneck hidden layer and a plurality of senones; extracting one or more weights between the top hidden layer and the low dimension hidden bottleneck layer, the one or more weights representing a feature dimension reduction; utilizing the feature dimension reduction to train a model for a second system following the extraction of the one or more weights between the top hidden layer and the low dimension bottleneck hidden layer; generating a first log likelihood score from the first system based on applicat
using neural networks · CPC title
using artificial neural networks · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.