Shared hidden layer combination for speech recognition systems

US9520127B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9520127-B2
Application numberUS-201414265110-A
CountryUS
Kind codeB2
Filing dateApr 29, 2014
Priority dateApr 29, 2014
Publication dateDec 13, 2016
Grant dateDec 13, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of providing a framework for merging two or more automatic speech recognition (ASR) system having a shared deep neural network (DNN) feature transformation, comprising: receiving, by a computing device, at least one utterance; training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprising a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden later output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR; combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and training, for the merged system, senone coefficient data for evaluation of spoken utterances. 2. The method of claim 1 , further comprising receiving a spoken utterance, and executing ASR recognition for the spoken utterance using the merged system. 3. The method of claim 2 , wherein the senone coefficient data is used to evaluate the spoken utterance to determine ASR results. 4. The method of claim 1 , wherein receiving, by a computing device, at least one utterance comprises receiving a plurality of training utterances for speech recognition. 5. The method of claim 1 , wherein the training of the at least one utterance comprises: training the first ASR system with a cross entropy criterion, the first ASR system comprising a DNN system; and deriving the DNN feature transformation from a top hidden layer of the DNN system. 6. The method of claim 1 , wherein the training of the at least one utterance comprises: training the first ASR system with sequential training criterion, the first ASR system comprising a DNN system; and deriving the DNN feature transformation from a top hidden layer of the DNN system. 7. The method of claim 1 , wherein utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer comprises generating a network comprising a low dimension bottleneck hidden layer and a plurality of senones. 8. The method of claim 1 , wherein generating, by the computing device, the first score and the second score comprises generating log likelihood scores from a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system. 9. The limitation of claim 1 , wherein combining, by the computing device, the first score and the second score comprises performing a linear combination of the first score from the first ASR system and the second score from the second ASR system. 10. The method of claim 1 , wherein combining, by the computing device, the first score and the second score comprises performing a non-linear combination of the first score from the first ASR system and the second score from the second ASR system. 11. A system comprising: at least one processor; and a memory operatively connected with the at least one processor, wherein the memory stores computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises: receiving, by a computing device, at least one utterance; training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprises a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR system; combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and training, for the merged system, senone coefficient data for evaluation of spoken utterances. 12. The system according to claim 11 , wherein the method, executed by the at least one processor, further comprises receiving a spoken utterance, and executing ASR recognition for the spoken utterance using the merged system. 13. The system according to claim 11 , wherein the training of the at least one utterance comprises: training the first ASR system with at least one of a cross entropy criterion and a sequential training criterion, and deriving the DNN feature transformation from a top hidden layer of a DNN system. 14. The system according to claim 11 , wherein the generating of the first score and the second score further comprises generating log likelihood scores from a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system. 15. The system according to claim 11 , wherein the combining of the first score and the second score occurs by executing at least one selected from a group consisting of: performing a non-linear combination of the first score and the second score, and performing a linear combination of the first score and the second score. 16. A computer-readable storage device storing computer executable instructions which, when executed by a computer, cause computer to perform a method of providing a framework for merging systems having a shared deep neural network (DNN) feature transformation, the method comprising: receiving a plurality of training utterances for speech recognition; training a first system with one or more of a cross entropy criterion and a sequential training criterion utilizing the plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers; generating an output from a top hidden layer in the plurality of hidden layers for the plurality of training utterances; utilizing the top hidden layer output to generate a network comprising a low dimension bottleneck hidden layer and a plurality of senones; extracting one or more weights between the top hidden layer and the low dimension hidden bottleneck layer, the one or more weights representing a feature dimension reduction; utilizing the feature dimension reduction to train a model for a second system following the extraction of the one or more weights between the top hidden layer and the low dimension bottleneck hidden layer; generating a first log likelihood score from the first system based on applicat

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9520127B2 cover?
Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights repre…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 13 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).