Robust speech recognition in the presence of echo and noise using multiple signals for discrimination
US-2016358602-A1 · Dec 8, 2016 · US
US9697826B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9697826-B2 |
| Application number | US-201615205321-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 8, 2016 |
| Priority date | Mar 27, 2015 |
| Publication date | Jul 4, 2017 |
| Grant date | Jul 4, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined.
Opening claim text (preview).
What is claimed is: 1. A system comprising: one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 2. The system of claim 1 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 3. The system of claim 1 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers. 4. The system of claim 1 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution. 5. The system of claim 3 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers. 6. The system of claim 1 , wherein combining the convolution outputs comprises: summing, for each of the multiple filters, the convolution outputs obtained for different channels using the filter to generate summed outputs corresponding to different time periods; and pooling, for each of the multiple filters, the summed outputs across the different time periods to generated a set of pooled values for the filter. 7. The system of claim 6 , wherein pooling the summed outputs across the different time periods comprises max pooling the summed outputs across the different time periods to identify maximum values among the summed outputs for the different time periods. 8. The system of claim 6 , wherein combining the convolution outputs comprises applying a rectified non-linearity to the sets of pooled values for each of the multiple filters to obtain rectified values; wherein inputting the combined convolution outputs to the deep neural network comprises inputting the rectified values to the deep neural network. 9. The system of claim 8 , wherein the rectified non-linearity comprises a logarithm compression. 10. The system of claim 1 , wherein the filters are configured to perform both spatial and spectral filtering. 11. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model comprises training the multiple filters and the deep neural network using a single module of an automated speech recognizer. 12. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model is performed using training data that includes audio data from a plurality of different microphone spacing configurations. 13. A computer-implemented method comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 14. The method of claim 13 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 15. The method of claim 13 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers. 16. The method of claim 13 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution. 17. The method of claim 15 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers. 18. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 19. The non-transitory computer-readable medium of claim 18 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 20. The non-transitory computer-readable medium of claim 18 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.