Processing multi-channel audio waveforms
US-9697826-B2 · Jul 4, 2017 · US
US9886949B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9886949-B2 |
| Application number | US-201615392122-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 28, 2016 |
| Priority date | Mar 23, 2016 |
| Publication date | Feb 6, 2018 |
| Grant date | Feb 6, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance; generating, using a trained recurrent neural network, (i) a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and (ii) a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data; generating a single combined channel of audio data by combining (i) audio data of the first channel that has been filtered using the first filter and (ii) audio data of the second channel that has been filtered using the second filter; inputting the audio data for the single combined channel to a neural network trained as an acoustic model; and providing a transcription for the utterance that is determined based at least on output that the neural network provides in response to receiving the audio data for the single combined channel. 2. The method of claim 1 , wherein the recurrent neural network comprises one or more long short-term memory layers. 3. The method of claim 1 , wherein the recurrent neural network comprises: a first long short-term memory layer that receives both first and second channels of audio; and a second long short-term memory layer corresponding to the first channel and a third long short-term memory layer corresponding to the second channel, the second long short-term memory layer and the third long short-term memory layer each receiving the output of the first long short-term memory layer and providing a set of filter parameters for the corresponding channel. 4. The method of claim 3 , wherein the long short-term memory layer layers have parameters that have been learned during a training process that jointly trains the long short-term memory layers and the neural network that is trained as an acoustic model. 5. The method of claim 1 , comprising: changing, or generating, new filter parameters for each input frame of audio data. 6. The method of claim 1 , comprising: for each audio frame in a sequence of audio frames of the utterance, generating and a new set of filter parameters and convolving audio data for the frame with a filter with the new set of filter parameters. 7. The method of claim 1 , wherein the first filter and the second filter are finite impulse response filters. 8. The method of claim 1 , wherein the first filter and the second filter have different parameters. 9. The method of claim 1 , wherein different microphone outputs are convolved with different filters. 10. The method of claim 1 , wherein the first and second channels of audio data are first and second channels of audio waveform data for the utterance, wherein the first and second channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 11. The method of claim 1 , wherein the neural network trained as an acoustic model comprises a convolutional layer, one or more long-short term memory layers, and multiple hidden layers. 12. The method of claim 11 , wherein the convolutional layer of the neural network trained as an acoustic model is configured to perform a time domain convolution. 13. The method of claim 11 , wherein the neural network trained as an acoustic model is configured such that output of the convolutional layer is pooled to generate a set of pooled values. 14. The method of claim 13 , wherein the neural network trained as an acoustic model is configured to input the pooled values to one or more long-short term memory layers within the neural network trained as an acoustic model. 15. The method of claim 1 , wherein the first and second filters are configured to perform both spatial and spectral filtering. 16. The method of claim 1 , comprising: convolving the audio data for the first channel with a first filter having the first set of filter parameters to generate first convolution outputs; convolving the audio data for the second channel with a second filter having the second set of filter parameters to generate second convolution outputs; and combining the first convolution outputs and the second convolution outputs. 17. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance; generating, using a trained recurrent neural network, (i) a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and (ii) a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data; generating a single combined channel of audio data by combining (i) audio data of the first channel that has been filtered using the first filter and (ii) audio data of the second channel that has been filtered using the second filter; inputting the audio data for the single combined channel to a neural network trained as an acoustic model; and providing a transcription for the utterance that is determined based at least on output that the neural network provides in response to receiving the audio data for the single combined channel. 18. The system of claim 17 , wherein the recurrent neural network comprises: a first long short-term memory layer that receives both first and second channels of audio; and a second long short-term memory layer corresponding to the first channel and a third long short-term memory layer corresponding to the second channel, the second long short-term memory layer and the third long short-term memory layer each receiving the output of the first long short-term memory layer and providing a set of filter parameters for the corresponding channel. 19. The system of claim 17 , wherein the operations further comprise: convolving the audio data for the first channel with a first filter having the first set of filter parameters to generate first convolution outputs; convolving the audio data for the second channel with a second filter having the second set of filter parameters to generate second convolution outputs; combining the first convolution outputs and the second convolution outputs. 20. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance; generating, using a trained recurrent neural network, (i) a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and (ii) a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data; generating a single combined channel of audio data by combining (i) audio data of the first channel that has been filtered using the first filter and (ii) audio data of the second channel that has been filtered using the second filter; inputting the audio data for the single combined channel to a neural netwo
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
characterised by the method used for estimating noise · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Processing in the time domain · CPC title
Microphone arrays; Beamforming · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.