Deployed end-to-end speech recognition

US10319374B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10319374-B2
Application numberUS-201615358083-A
CountryUS
Kind codeB2
Filing dateNov 21, 2016
Priority dateNov 25, 2015
Publication dateJun 11, 2019
Grant dateJun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages. Using a trained embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an end-to-end deep learning system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for transcribing speech comprising: receiving an input audio from a user, the input audio comprising an utterance; generating a set of spectrogram frames for the utterance; inputting the set of spectrogram frames into a deep neural network (DNN) model, the DNN model comprising one or more convolution layers, one or more recurrent layers, and a row convolution layer in which an activation at a current time step of the row convolution layer is obtained using information from at least one of the one or more recurrent layers at the current time step and at least one future time step, the DNN model having been trained using a plurality of minibatches of training utterance sequences and batch normalization to normalize pre-activations in at least one of the one or more recurrent layers during training; obtaining probabilities for one or more predicted characters from the DNN model; and performing a search to obtain a transcription of the utterance using at least some of the probabilities of the one or more predicted characters and a language model. 2. The computer-implemented method of claim 1 wherein the batch normalization comprises computing a mean and variance over the length of each training utterance sequence in each minibatch for each hidden unit in one or more of the one or more recurrent layers. 3. The computer-implemented method of claim 1 wherein the row convolution layer is functionally positioned above the one or more recurrent layers. 4. The computer-implemented method of claim 3 wherein the one or more recurrent layers are unidirectional and forward-only layers. 5. The computer-implemented method of claim 4 wherein the activation of the row convolution layer is used for character prediction corresponding to the current time step. 6. The computer-implemented method of claim 1 wherein the predicted characters are English characters or Chinese characters. 7. The computer-implemented method of claim 1 wherein the input audio is normalized to make a total power of the input audio consistent with a set of training samples used to train the DNN model. 8. The computer-implemented method of claim 1 wherein a beam search in the language model is implemented to consider characters with a cumulative probability having at least a threshold probability. 9. The computer-implemented method of claim 1 wherein in generating the set of spectrogram frames, subsampling the utterance is implemented in obtaining the set of spectrogram frames by taking strides of a step size of predetermined number of time slices. 10. The computer-implemented method of claim 1 wherein the predicted characters from the DNN model comprise non-overlapping n-grams formed from words. 11. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising: receiving an input audio, the input audio comprising an utterance; obtaining a set of spectrogram frames for the utterance; inputting the set of spectrogram frames into a trained neural network, the trained neural network comprising one or more convolution layers, one or more recurrent layers, and a row convolution layer in which an activation at a current time step of the row convolution layer is obtained using information from at least one of the one or more recurrent layers at the current time step and at least one future time; obtaining probabilities of one or more predicted characters from the trained neural network; and obtaining a transcription of the utterance using probabilities constrained by a language model. 12. The non-transitory computer-readable medium or media of claim 11 wherein the step of inputting the set of spectrogram frames into a trained neural network comprises taking strides of a step size of predetermined number of time slices. 13. The non-transitory computer-readable medium or media of claim 11 wherein the one or more predicted character are selected from a model alphabet comprising the English alphabet and symbols representing alternate labellings comprising non-overlapping n-grams. 14. The non-transitory computer-readable medium or media of claim 11 wherein the steps further comprising normalizing the input audio by using statistics from the training data set. 15. The non-transitory computer-readable medium or media of claim 11 wherein the one or more recurrent layers are unidirectional and forward-only. 16. A computer-implemented method for speech transcription, the method comprising: receiving a set of spectrogram frames which correspond to an utterance; using the set of spectrogram frames as input to a trained neural network, which comprises one or more recurrent layers, whose activation at a current time step is determined using a hidden state of a recurrent layer from the one or more recurrent layers at the current time step and a future hidden states context comprising hidden states of the recurrent layer for one or more future time steps, to obtain one or more predicted characters corresponding to the current time step. 17. The computer-implemented method of claim 16 wherein the one or more recurrent layers are forward-only layers. 18. The computer-implemented method of claim 16 wherein the one or more recurrent layers are trained using a plurality of minibatches of training utterance sequences from a training data set, the plurality of minibatches being normalized during training to normalize pre-activations in at least one of the one or more recurrent layers. 19. The computer-implemented method of claim 16 wherein the activation at the current time step of the row convolution layer is obtained from a weighted sum of a feature matrix comprising the hidden state of the recurrent layer and the future hidden states context of the recurrent layer, in which values of the feature matrix are weighted by a parameter matrix. 20. The computer-implemented method of claim 16 further comprising the step of obtaining a transcription of the utterance using at least some of the predicted characters of the time steps and a language model.

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • the extracted parameters being spectral information of each sub-band · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10319374B2 cover?
Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different language…
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).