Speech recognition apparatus and method
US-2017154033-A1 · Jun 1, 2017 · US
US9984683B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9984683-B2 |
| Application number | US-201615217457-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 22, 2016 |
| Priority date | Jul 22, 2016 |
| Publication date | May 29, 2018 |
| Grant date | May 29, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for automatic speech recognition using multi-dimensional models. In some implementations, audio data that describes an utterance is received. A transcription for the utterance is determined using an acoustic model that includes a neural network having first memory blocks for time information and second memory blocks for frequency information. The transcription for the utterance is provided as output of an automated speech recognizer.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving, by the one or more computers, audio data that describes an utterance; processing the audio data using a neural network that has been trained as an acoustic model, wherein the processing comprises: providing, as input to the neural network, input vectors having values describing the utterance, the values including values representing audio waveform features, wherein the audio waveform features are determined using a filterbank having parameters trained jointly with weights of the neural network, wherein the neural network has first memory blocks for time information and second memory blocks for frequency information, the first memory blocks being different from the second memory blocks; wherein the first memory blocks are time-LSTM blocks that each have a state, and wherein the second memory blocks are frequency-LSTM blocks that each have a state and a corresponding frequency step in a sequence of multiple frequency steps, wherein the states are determined for each of a sequence of multiple time steps; wherein, for each of at least some of the frequency-LSTM blocks, the frequency-LSTM block determines its state using the state of the time-LSTM block corresponding to the same frequency step at the previous time step; and wherein, for each of at least some of the time-LSTM blocks, the time-LSTM block determines its state using the state of the frequency-LSTM block corresponding to the same time step and the previous frequency step; receiving, as output of the neural network, one or more scores that each indicate a likelihood that a respective phonetic unit represents a portion of the utterance; determining, by the one or more computers, a transcription for the utterance based on the one or more scores; and providing, by the one or more computers, the determined transcription as output of an automated speech recognizer. 2. The method of claim 1 , wherein receiving the audio data comprises receiving, over a network, audio data generated by a client device; and wherein providing the transcription comprises providing, over the network, the transcription to the client device. 3. The method of claim 1 , wherein the neural network comprises a grid-LSTM module, a linear projection layer, one or more LSTM layers, and a deep neural network (DNN); wherein the grid-LSTM module includes the first memory blocks and the second memory blocks, and the grid-LSTM module provides output to the linear projection layer; wherein the linear projection layer provides output to the one or more LSTM layers; and wherein the one or more LSTM layers provide output to the DNN. 4. The method of claim 1 , wherein the neural network is configured to share information from each of the first memory blocks with a respective proper subset of the second memory blocks, and the neural network is configured to share information from each of the second memory blocks with a respective proper subset of the first memory blocks. 5. The method of claim 1 , wherein, for at least some of the time-LSTM blocks, the time-LSTM block determines its state based on (i) input received for a current time step, (ii) the state of the time-LSTM block at a previous time step, and (iii) a state of exactly one of the frequency-LSTM blocks; and wherein, for at least some of the frequency-LSTM blocks, the frequency-LSTM block determines its state based on (i) input received for a current time step, (ii) the state of the frequency-LSTM block for a previous frequency step at the current time step, and (iii) a state of exactly one of the time-LSTM blocks. 6. The method of claim 1 , wherein each of the time-LSTM blocks and the frequency-LSTM blocks has one or more weights, and wherein the weights for the time-LSTM blocks are independent of the weights for the frequency-LSTM blocks. 7. The method of claim 1 , wherein each of the time-LSTM blocks and the frequency-LSTM blocks has one or more weights, and wherein at least some of the weights are shared between the time-LSTM blocks and the frequency-LSTM blocks. 8. The method of claim 1 , wherein receiving the one or more scores comprises receiving, as output of the neural network, multiple outputs corresponding to different context-dependent states of phones, wherein each of the multiple outputs indicates a likelihood of occurrence for the corresponding context-dependent state. 9. The method of claim 1 , wherein providing the input vectors comprises providing input vectors comprising values for log-mel features. 10. The method of claim 1 , wherein providing the input vectors comprises providing values representing characteristics of multiple channels of audio describing the utterance. 11. The method of claim 1 , wherein processing the audio data using the neural network comprises processing audio data representing characteristics of multiple channels of audio describing the utterance. 12. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving, by the one or more computers, audio data that describes an utterance; processing the audio data using a neural network that has been trained as an acoustic model, wherein the processing comprises: providing, as input to the neural network, input vectors having values describing the utterance, the values including values representing audio waveform features, wherein the audio waveform features are determined using a filterbank having parameters trained jointly with weights of the neural network, wherein the neural network has first memory blocks for time information and second memory blocks for frequency information, the first memory blocks being different from the second memory blocks; wherein the first memory blocks are time-LSTM blocks that each have a state, and wherein the second memory blocks are frequency-LSTM blocks that each have a state and a corresponding frequency step in a sequence of multiple frequency steps, wherein the states are determined for each of a sequence of multiple time steps; wherein, for each of at least some of the frequency-LSTM blocks, the frequency-LSTM block determines its state using the state of the time-LSTM block corresponding to the same frequency step at the previous time step; and wherein, for each of at least some of the time-LSTM blocks, the time-LSTM block determines its state using the state of the frequency-LSTM block corresponding to the same time step and the previous frequency step; receiving, as output of the neural network, one or more scores that each indicate a likelihood that a respective phonetic unit represents a portion of the utterance; determining, by the one or more computers, a transcription for the utterance based on the one or more scores; and providing, by the one or more computers, the determined transcription as output of an automated speech recognizer. 13. The system of claim 12 , wherein the neural network comprises a grid-LSTM module, a linear projection layer, one or more LSTM layers, and a deep neural network (DNN); wherein the grid-LSTM module includes the first memory blocks and the second memory blocks, and the grid-LSTM module provides output to the linear projection layer; wherein the linear projection layer provides output to the one or more LSTM layers; and wherein the one or more LSTM layers provide output to the DNN. 14. The system of claim 12 , wherein processing the audio data using the neural network comprises processing audio data repres
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.