Methods and apparatus for training an artificial neural network for use in speech recognition
US-9627532-B2 · Apr 18, 2017 · US
US9799327B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9799327-B1 |
| Application number | US-201615055476-A |
| Country | US |
| Kind code | B1 |
| Filing date | Feb 26, 2016 |
| Priority date | Feb 26, 2016 |
| Publication date | Oct 24, 2017 |
| Grant date | Oct 24, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance.
Opening claim text (preview).
What is claimed is: 1. A computer implemented method comprising: obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence, wherein the alternative representation for the input acoustic sequence comprises a respective alternative acoustic feature representation for each of a second number of time steps, and wherein the second number is smaller than the first number; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance. 2. The method of claim 1 , wherein a substring comprises one or more characters. 3. The method of claim 2 , wherein the set of substrings comprises a set of alphabetic letters which is used to write one or more natural languages. 4. The method of claim 3 , wherein the substrings in the set of substrings further comprise a space character, a comma character, a period character, an apostrophe character, and an unknown character. 5. The method of claim 1 , wherein the generated sequence of substrings begins with a start of sequence token <sos> and ends with an end of sequence token <eos>. 6. The method of claim 1 , wherein the first neural network is a pyramid Bidirectional Long Short Term Memory (BLSTM) RNN. 7. The method of claim 6 , wherein processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence comprises: processing the input acoustic sequence through a bottom BLSTM layer to generate a BLSTM layer output; and processing the BLSTM layer output through each a plurality of pyramid BLSTM layers, wherein consecutive outputs of each pyramid BLSTM layer are concatenated before being provided to the next pyramid BLSTM layer. 8. The method of claim 1 , wherein processing the alternative representation for the input acoustic sequence using an attention-based RNN comprises, for an initial position in the output sequence order: processing a placeholder start of sequence token and a placeholder initial attention context vector using the attention-based RNN to update a hidden state of the attention-based RNN from an initial hidden state to a hidden state for the initial position in the output sequence order; generating an attention context vector for the initial position from the alternative representation and the RNN hidden state for the initial position in the output sequence order; and generating the set of substring scores for the initial position using the attention context vector for the initial position and the RNN hidden state for the initial position. 9. The method of claim 8 , further comprising selecting a highest scoring substring from the set of substring scores as the substring at the initial position in the output sequence of substrings. 10. The method of claim 1 , wherein processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) comprises, for each position after an initial position in the output sequence order: processing a substring at the preceding position in the output sequence order and an attention context vector for the preceding position in the order using the attention-based RNN to update the hidden state of the attention-based RNN from the hidden state for the preceding position to a hidden state for the position; generating an attention context vector for the position from the alternative representation and the RNN hidden state for the position in the output sequence order; and generating the set of substring scores for the position using the attention context vector for the position and the RNN hidden state for the position. 11. The method of claim 10 , further comprising selecting a highest scoring substring from the set of substring scores for the position as the substring at the position in the output sequence of substrings. 12. The method of claim 10 , wherein generating an attention context vector for the position from the alternative representation and the RNN hidden state for the position in the output sequence order comprises: computing a scalar energy for the position using the alternative representation and the hidden state of the attention-based RNN for the position; converting the computed scalar energy into a probability distribution using a softmax function; and using the probability distribution to create a context vector by combining the alternative representation at different positions. 13. The method of claim 10 , wherein generating the set of substring scores for the position using the attention context vector for the position and the RNN hidden state for the position comprises: providing the hidden state of the attention-based RNN for the position and generated attention context vector for the position as input to a multi-layer perceptron (MLP) with a softmax output layer; and processing the hidden state of the attention-based RNN for the position and generated attention context vector for the position using the MLP to generate a set of substring scores for each substring in the set of substrings for the position. 14. The method of claim 1 , wherein the first neural network and the attention-based recurrent neural network are trained jointly. 15. The method of claim 1 , wherein processing the alternative representation for the input sequence using an attention-based Recurrent Neural Network (RNN) comprises processing the alternative representation using an attention-based RNN using a left to right beam search decoding. 16. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence, wherein the alternative representation for the input acoustic sequence comprises a respective alternative acoustic feature representation for each of a second number of time steps, and wherein the second number is smaller than the first number; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance. 17. The system of claim 16 , wherein a substring comprises one or more characters. 18. The system of claim 17 , wherein the set of substrings comprises a set of alphabetic letters whic
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
using neural networks · CPC title
Version control (for software G06F8/71) · CPC title
Use of codes for handling textual entities · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.