Latency constraints for acoustic modeling
US-2017103752-A1 · Apr 13, 2017 · US
US2017148431A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017148431-A1 |
| Application number | US-201615358102-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 21, 2016 |
| Priority date | Nov 25, 2015 |
| Publication date | May 25, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages. Using a trained embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an end-to-end deep learning system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method for training a transcription model for speech transcription, the method comprising: for each of a set of utterances: obtaining a set of spectrogram frames from each utterance, the utterance having an associated ground-truth label, the utterance and the associated ground-truth label being sampled from a training set comprising a plurality of minibatches; outputting from the transcription model a predicted character or character probabilities for the utterance, the transcription model comprising one or more convolution layers and one or more recurrent layers, a batch normalization being applied for one or more minibatches within the plurality of minibatches to normalize pre-activations in at least one of the one or more recurrent layers; computing a loss to measure an error in prediction of a character for the utterance given the associated ground-truth label; computing a derivative of the loss with respect to parameters of the transcription model; and updating the transcription model using the derivative through back-propagation. 2 . The computer-implemented method of claim 1 wherein the batch normalization is also implemented in one or more convolution layers. 3 . The computer-implemented method of claim 2 wherein the normalization comprises computing mean and variance over the length of an utterance sequence in a minibatch for each hidden unit for each layer to be batch normalized. 4 . The computer-implemented method of claim 1 wherein subsampling the utterance is implemented in obtaining the set of spectrogram frames by taking strides of a step size of predetermined number of time slices. 5 . The computer-implemented method of claim 4 wherein the predicted character from the transcription model comprises alternate labellings enriched from the English alphabet. 6 . The computer-implemented method of claim 5 wherein the alternate labellings are selected from whole words, syllables, and non-overlapping n-grams. 7 . The computer-implemented method of claim 6 wherein the non-overlapping n-grams are non-overlapping bigrams at word level. 8 . The computer-implemented method of claim 7 wherein any unigram labels in the output predicted character are transformed into bigram labels through an isomorphism. 9 . The computer-implemented method of claim 1 further comprising: in a first training epoch, iterate through the training set in an increasing order of the length of the longest utterance in each minibatch; and after the first training epoch, revert the plurality of minibatches back to a random order for additional transcription output training. 10 . The computer-implemented method of claim 1 wherein the training set is generated from raw audio clips and raw transcriptions through a data acquisition pipeline. 11 . The computer-implemented method of claim 10 wherein generating the training set comprises the following steps: aligning the raw audio clips and raw transcriptions; segmenting the aligned audio clips and the corresponding transcriptions whenever the audio encounters a series of consecutive blank labels occurs; and filtering the segmented audio clips and corresponding transcriptions by removing erroneous examples. 12 . A computer-implemented method for training a recurrent neural network (RNN) model for speech transcription, the method comprising: receiving, at a first layer of the RNN model, a set of spectrogram frames for each of a plurality of utterances, the plurality of utterance and associated labels being sampled from a training set; applying convolutions in at least one of frequency and time domains, in one or more convolution layers of the RNN model, to the set of spectrogram frames; predicting one or more characters through one or more recurrent layers of the RNN model, a batch normalization being implemented to normalize pre-activations in at least one of the one or more recurrent layers; obtaining a probability distribution over the predicted characters in an output layer of the RNN model; and implementing a Connectionist Temporal Classification (CTC) loss function to measure an error in prediction of a character for the utterance given the associated ground-truth label, the CTC loss function implementation involving element-wise addition of a forward matrix and a backward matrix generated during a forward pass and a backward pass of the CTC loss function respectively, all elements in each column of the forward matrix being calculated for the CTC loss function implementation; computing a derivative of the loss with respect to parameters of the RNN model; and updating the RNN model using the derivative through back-propagation. 13 . The computer-implemented method of claim 12 wherein the normalization comprises computing mean and variance over the length of each utterance for the one or more recurrent layers. 14 . The computer-implemented method of claim 12 wherein the CTC loss function is implemented in log probability space. 15 . The computer-implemented method of claim 12 wherein the CTC loss function is implemented is a graphics processing unit (GPU) based implementation. 16 . The computer-implemented method of claim 15 wherein the CTC loss function algorithm comprises one or more of the following approaches: (a) for gradient computation, taking each column of the matrix generated from element-wise addition of the forward and backward matrices and doing a key-value reduction using the predicted character as key; (b) mapping the forward and backward passes to corresponding compute kernels; and (c) performing a key-value sort, the key being the characters in the utterance label, and the value being the indices of each character in the utterance. 17 . A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more microprocessors, causes the steps to be performed comprising: receiving a plurality of batches of utterance sequences, each utterance sequence and associated label being sampled from a training set; outputting a probability distribution over predicted characters corresponding to the utterance sequences to a Connectionist Temporal Classification (CTC) layer; and implementing a CTC loss function algorithm for speech transcription training, the implementation involving element-wise addition in log probability space of a forward matrix and a backward matrix generated during a forward pass and a backward pass of the CTC loss function respectively, all elements in each column of the forward matrix being calculated for the CTC loss function implementation. 18 . The non-transitory computer-readable medium or media of claim 17 wherein the steps further comprising mapping each utterance sequence in the plurality of batches to a compute thread block. 19 . The non-transitory computer-readable medium or media of claim 18 wherein rows of the forward matrix and the backward matrix are processed in parallel by the compute thread block, columns of the forward matrix and the backward matrix are processed sequentially by the compute thread block. 20 . The non-transitory computer-readable medium or media of claim 17 wherein the steps further comprising mapping the forward pass and backward pass to a forward compute kernel and a backward compute kernel respectively.
Recurrent networks, e.g. Hopfield networks · CPC title
Probabilistic grammars, e.g. word n-grams · CPC title
Training · CPC title
Backpropagation, e.g. using gradient descent · CPC title
the extracted parameters being power information · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.