System and method for speech recognition using deep recurrent neural networks
US-9263036-B1 · Feb 16, 2016 · US
US10176799B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10176799-B2 |
| Application number | US-201615013239-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 2, 2016 |
| Priority date | Feb 2, 2016 |
| Publication date | Jan 8, 2019 |
| Grant date | Jan 8, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and for training a language model to reduce recognition errors, wherein the language model is a recurrent neural network language model (RNNLM) by first acquiring training samples. An automatic speech recognition system (ASR) is applied to the training samples to produce recognized words and probabilites of the recognized words, and an N-best list is selected from the recognized words based on the probabilities. determining word errors using reference data for hypotheses in the N-best list. The hypotheses are rescored using the RNNLM. Then, we determine gradients for the hypotheses using the word errors and gradients for words in the hypotheses. Lastly, parameters of the RNNLM are updated using a sum of the gradients.
Opening claim text (preview).
We claim: 1. A method for speech recognition to reduce recognition errors using a language model, wherein the language model is a recurrent neural network language model (RNNLM) that is in communication with a Long Short-Term Memory (LSTM), comprising the steps of: acquiring training samples during a training stage for training the RNNLM to perform applying an automatic speech recognition system (ASR) to the training samples to produce recognized words and probabilites of the recognized words; selecting an N-best list from the recognized words based on the probabilities; determining word errors using reference data for hypotheses in the N-best list; rescoring the hypotheses using the RNNLM in communication with the LSTM; determining gradients for the hypotheses using the word errors, wherein the determined gradients for the hypotheses corresponds to differences with respect to the N-best hypothesis scores; determining gradients for recognized words in the hypotheses; back-propagating the gradients; updating parameters of the RNNLM using a sum of the gradients as an error signal for the RNNLM, so as to the reduce recognition errors of the ASR; acquiring spoken utterances as an input to the RNNLM to produce the recognized words; producing the N-best list from the recognized words; and applying the RNNLM to the N-best list to obtain recognition results, wherein the steps are performed in a processor. 2. The method of claim 1 , wherein a stochastic gradient descent method is applied on an utterance-by-utterance basis so that the gradients are accumulated over the N-best list. 3. The method of claim 1 , wherein an output vector y t ∈[0,1] |V|+|C| (|C|, is a number of classes, includes of word (w) and class (c) outputs y t = [ y t ( w ) y t ( c ) ] , obtained as y t,m (w) =ζ( W ho,m (w) h t ), and y t (c) =ζ( W ho (c) h t ), where y t,m (w) and are sub-vector of y t (w) and sub-matrix of W ho corresponding to the words in an m-th class, respectively, and W ho (c) is a sub-matrix of W ho for the class output, where W ho is a matrix placed between a hidden layer and the output layer of the RNNLM, h t is a D dimensional activation vector h t ∈[0,1] D in a hidden layer, and ζ(⋅) denotes a softmax function that determines a softmax for elements of the vectors. 4. The method of claim 3 , wherein a word occurrence probability is P ( w t |h t )≡ y t,C(w t ) (w) [w t ]×y t (c) [C ( w t )] where C(w) denotes an index of the class to which the word w belongs. 5. The method of claim 4 , wherein a loss function of minimum word error training is L ( Λ ) = ∑ k = 1 K ∑ W ∈ V * E ( W k ( R ) , W ) P Λ ( W ❘ O k ) , where Λ is a set of model parameters, K is the number of utterances in training data, O k is a k-th acoustic observation sequence, and W k (R) ={w k,1 (R) , . . . , w k,T k (R) } is a k-th reference word sequence, E(W′,W) represents an edit distance between two word sequences W′ and W, and P Λ (W|O) is a posterior probability of W determined with the set of model parameter Λ. 6. The method of claim 5 , further comprising: obtaining, the the N-best lists and obtain a loss function L ( Λ ) = ∑ k = 1 K ∑ N n = 1 E ( W k (
Combinations of networks · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Speech classification or search · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.