What technology area does this patent fall under?

Primary CPC classification G10L15/063. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu May 25 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

End-to-end speech recognition

US2017148431A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2017148431-A1
Application number	US-201615358102-A
Country	US
Kind code	A1
Filing date	Nov 21, 2016
Priority date	Nov 25, 2015
Publication date	May 25, 2017
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different languages. Using a trained embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an end-to-end deep learning system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for training a transcription model for speech transcription, the method comprising: for each of a set of utterances: obtaining a set of spectrogram frames from each utterance, the utterance having an associated ground-truth label, the utterance and the associated ground-truth label being sampled from a training set comprising a plurality of minibatches; outputting from the transcription model a predicted character or character probabilities for the utterance, the transcription model comprising one or more convolution layers and one or more recurrent layers, a batch normalization being applied for one or more minibatches within the plurality of minibatches to normalize pre-activations in at least one of the one or more recurrent layers; computing a loss to measure an error in prediction of a character for the utterance given the associated ground-truth label; computing a derivative of the loss with respect to parameters of the transcription model; and updating the transcription model using the derivative through back-propagation. 2 . The computer-implemented method of claim 1 wherein the batch normalization is also implemented in one or more convolution layers. 3 . The computer-implemented method of claim 2 wherein the normalization comprises computing mean and variance over the length of an utterance sequence in a minibatch for each hidden unit for each layer to be batch normalized. 4 . The computer-implemented method of claim 1 wherein subsampling the utterance is implemented in obtaining the set of spectrogram frames by taking strides of a step size of predetermined number of time slices. 5 . The computer-implemented method of claim 4 wherein the predicted character from the transcription model comprises alternate labellings enriched from the English alphabet. 6 . The computer-implemented method of claim 5 wherein the alternate labellings are selected from whole words, syllables, and non-overlapping n-grams. 7 . The computer-implemented method of claim 6 wherein the non-overlapping n-grams are non-overlapping bigrams at word level. 8 . The computer-implemented method of claim 7 wherein any unigram labels in the output predicted character are transformed into bigram labels through an isomorphism. 9 . The computer-implemented method of claim 1 further comprising: in a first training epoch, iterate through the training set in an increasing order of the length of the longest utterance in each minibatch; and after the first training epoch, revert the plurality of minibatches back to a random order for additional transcription output training. 10 . The computer-implemented method of claim 1 wherein the training set is generated from raw audio clips and raw transcriptions through a data acquisition pipeline. 11 . The computer-implemented method of claim 10 wherein generating the training set comprises the following steps: aligning the raw audio clips and raw transcriptions; segmenting the aligned audio clips and the corresponding transcriptions whenever the audio encounters a series of consecutive blank labels occurs; and filtering the segmented audio clips and corresponding transcriptions by removing erroneous examples. 12 . A computer-implemented method for training a recurrent neural network (RNN) model for speech transcription, the method comprising: receiving, at a first layer of the RNN model, a set of spectrogram frames for each of a plurality of utterances, the plurality of utterance and associated labels being sampled from a training set; applying convolutions in at least one of frequency and time domains, in one or more convolution layers of the RNN model, to the set of spectrogram frames; predicting one or more characters through one or more recurrent layers of the RNN model, a batch normalization being implemented to normalize pre-activations in at least one of the one or more recurrent layers; obtaining a probability distribution over the predicted characters in an output layer of the RNN model; and implementing a Connectionist Temporal Classification (CTC) loss function to measure an error in prediction of a character for the utterance given the associated ground-truth label, the CTC loss function implementation involving element-wise addition of a forward matrix and a backward matrix generated during a forward pass and a backward pass of the CTC loss function respectively, all elements in each column of the forward matrix being calculated for the CTC loss function implementation; computing a derivative of the loss with respect to parameters of the RNN model; and updating the RNN model using the derivative through back-propagation. 13 . The computer-implemented method of claim 12 wherein the normalization comprises computing mean and variance over the length of each utterance for the one or more recurrent layers. 14 . The computer-implemented method of claim 12 wherein the CTC loss function is implemented in log probability space. 15 . The computer-implemented method of claim 12 wherein the CTC loss function is implemented is a graphics processing unit (GPU) based implementation. 16 . The computer-implemented method of claim 15 wherein the CTC loss function algorithm comprises one or more of the following approaches: (a) for gradient computation, taking each column of the matrix generated from element-wise addition of the forward and backward matrices and doing a key-value reduction using the predicted character as key; (b) mapping the forward and backward passes to corresponding compute kernels; and (c) performing a key-value sort, the key being the characters in the utterance label, and the value being the indices of each character in the utterance. 17 . A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more microprocessors, causes the steps to be performed comprising: receiving a plurality of batches of utterance sequences, each utterance sequence and associated label being sampled from a training set; outputting a probability distribution over predicted characters corresponding to the utterance sequences to a Connectionist Temporal Classification (CTC) layer; and implementing a CTC loss function algorithm for speech transcription training, the implementation involving element-wise addition in log probability space of a forward matrix and a backward matrix generated during a forward pass and a backward pass of the CTC loss function respectively, all elements in each column of the forward matrix being calculated for the CTC loss function implementation. 18 . The non-transitory computer-readable medium or media of claim 17 wherein the steps further comprising mapping each utterance sequence in the plurality of batches to a compute thread block. 19 . The non-transitory computer-readable medium or media of claim 18 wherein rows of the forward matrix and the backward matrix are processed in parallel by the compute thread block, columns of the forward matrix and the backward matrix are processed sequentially by the compute thread block. 20 . The non-transitory computer-readable medium or media of claim 17 wherein the steps further comprising mapping the forward pass and backward pass to a forward compute kernel and a backward compute kernel respectively.

Assignees

Inventors

Classifications

G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G10L15/197
Probabilistic grammars, e.g. word n-grams · CPC title
G10L15/063Primary
Training · CPC title
G06N3/084
Backpropagation, e.g. using gradient descent · CPC title
G10L25/21
the extracted parameters being power information · CPC title

Patent family

Related publications grouped by family.

View patent family 58721011

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017148431A1 cover?: Embodiments of end-to-end deep learning systems and methods are disclosed to recognize speech of vastly different languages, such as English or Mandarin Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced with neural networks, and the end-to-end learning allows handling a diverse variety of speech including noisy environments, accents, and different language…
Who is the assignee on this patent?: Baidu Usa Llc, Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu May 25 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).