Systems and methods for speech transcription

US10540957B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10540957-B2
Application numberUS-201514735002-A
CountryUS
Kind codeB2
Filing dateJun 9, 2015
Priority dateDec 15, 2014
Publication dateJan 21, 2020
Grant dateJan 21, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for transcribing speech comprising: receiving an input audio from a user; normalizing the input audio to make a total power of the input audio consistent with a set of training samples used to train a trained neural network; generating a jitter set of audio files from the normalized input audio by translating the normalized input audio by one or more time values; for each audio file from the jitter set of audio files, which includes the normalized input audio: generating a set of spectrogram frames for each audio file; inputting the set of spectrogram frames into a trained neural network; obtaining predicted character probabilities outputs from the trained neural network; and decoding a transcription of the input audio using the predicted character probabilities outputs from the trained neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. 2. The computer-implemented method of claim 1 wherein the step of generating a set of spectrogram frames for each audio file comprises: generating spectrogram frames wherein a spectrogram frame comprises a set of linearly spaced log filter banks computed over windows of a first value of milliseconds strided by a second value of milliseconds. 3. The computer-implemented method of claim 1 wherein: the step of inputting the set of spectrogram frames into a trained neural network comprises: inputting the set of spectrogram frames into a set of trained neural networks; and the step of obtaining predicted character probabilities outputs from the trained neural network comprises: ensembling predicted character probabilities outputs from the set of trained neural networks to obtain the predicted character probabilities. 4. The computer-implemented method of claim 3 wherein the step of ensembling predicted character probabilities outputs from the set of trained neural networks to obtain the predicted character probabilities comprises: addressing time shifts between trained neural networks by using one or more of the following comprising: using trained neural networks that exhibit the same temporal shift; shifting one or more of the outputs of the trained neural networks to align the outputs; and shifting one or more of the inputs into one or more of the trained neural networks to have aligned outputs. 5. The computer-implemented method of claim 1 wherein the step of decoding a transcription of the input audio using the predicted character probabilities outputs from the trained neural network constrained by a language model that interprets the string of characters as words comprises: given the predicted character probabilities outputs from the trained neural network, performing a search to find a sequence of characters that is most probable according to both the predicted character probabilities outputs and a trained N-gram language model output that interprets a string of characters from the predicted character probabilities outputs as a word or words. 6. The computer-implemented method of claim 1 wherein the trained neural network comprises a five-layer model comprising: a first set of three layers that are non-recurrent; a fourth layer that is a bi-directional recurrent network, which includes two sets of hidden units comprising a set with forward recurrence and a set with backward recurrence; and a fifth layer that is a non-recurrent layer, which takes forward and backward units from the fourth layer as inputs and outputs the predicted character probabilities. 7. The computer-implemented method of claim 3 wherein the step of inputting the set of spectrogram frames into a trained neural network comprises: inputting the set of spectrogram frames into the trained neural network in which at least one layer of the trained neural network operates on a context of spectrogram frames from the set of spectrogram frames. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising: receiving an input audio from a user; generating a set of spectrogram frames from the input audio; inputting the set of spectrogram frames into a set of trained neural networks; obtaining predicted character probabilities outputs from the set of trained neural networks; and decoding a transcription of the input audio using the predicted character probabilities outputs from the set of trained neural networks constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. 9. The non-transitory computer-readable medium or media of claim 7 wherein the step of generating a set of spectrogram frames comprises: generating spectrogram frames wherein a spectrogram frame comprises a set of linearly spaced log filter banks computed over windows of a first value of milliseconds strided by a second value of milliseconds. 10. The non-transitory computer-readable medium or media of claim 8 wherein the step of obtaining predicted character probabilities outputs from the set of trained neural networks comprises: ensembling predicted character probabilities outputs from the set of trained neural networks to obtain the predicted character probabilities. 11. The non-transitory computer-readable medium or media of claim 10 wherein the step of ensembling predicted character probabilities outputs from the set of trained neural networks to obtain the predicted character probabilities comprises: addressing time shifts between trained neural networks by using one or more of the following comprising: using trained neural networks that exhibit the same temporal shift; shifting one or more of the outputs of the trained neural networks to align the outputs; and shifting one or more of the inputs into one or more of the trained neural networks to have aligned outputs. 12. The non-transitory computer-readable medium or media of claim 8 wherein the step of generating a set of spectrogram frames from the input audio comprises: generating a set of spectrogram frames from a normalized version of the input audio or from a normalized and jitter version of the input audio. 13. A computer-implemented method for transcribing speech comprising: receiving an input audio from a user; generating a set of spectrogram frames for the input audio; inputting the set of spectrogram frames into a trained neural network; obtaining predicted character probabilities outputs from the trained neural network; and decoding a transcription of the input audio using the predicted character probabilities outputs from the trained neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. 14. The computer-implemented method of claim 13 wherein the step of generating a set of spectrogram frames from the input audio comprises: generating a set of spectrogram frames from a normalized version of the input audio or from a normalized and jitter version of the input audio. 15. The computer-implemented method of claim 14 wherein a normalized version of the input audio is obtained by performing the step comprising: normalizing the input audio to make a total power of the input audio consistent with a set of training samples used to train the trained neural network. 16. The computer-implemented method of claim 13 wherein the step of in

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10540957B2 cover?
Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the …
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 21 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).