Processing multi-channel audio waveforms
US-2016322055-A1 · Nov 3, 2016 · US
US11062725B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11062725-B2 |
| Application number | US-201916278830-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 19, 2019 |
| Priority date | Sep 7, 2016 |
| Publication date | Jul 13, 2021 |
| Grant date | Jul 13, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving, at data processing hardware, a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time; obtaining, by the data processing hardware, a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal; generating, by the data processing hardware, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one; converting, by the data processing hardware, the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data; and processing, by the data processing hardware, using one or more additional neural network layers of the neural network, the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions to predict speech content encoded in the first audio signal and the second audio signal. 2. The method of claim 1 , wherein converting the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data comprises computing a discrete Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions. 3. The method of claim 2 , wherein computing the discrete Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions comprises computing a fast Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions. 4. The method of claim 1 , wherein the neural network is part of a speech recognition model. 5. The method of claim 1 , wherein the neural network is part of an acoustic model configured to indicate probabilities of sub-word units. 6. The method of claim 1 , wherein the one or more additional neural network layers comprise one or more deep neural network layers that provide output to one or more long short-term memory layers. 7. The method of claim 1 , wherein the corresponding spatial filtered output generated for each of the multiple spatial directions comprises a single channel of time-domain data. 8. The method of claim 1 , wherein at least one additional neural network layer of the one or more additional neural network layers is configured to perform feature extraction. 9. The method of claim 8 , wherein the at least one additional neural network layer of the one or more additional neural network layers that is configured to perform feature extraction is also configured to apply a transformation to the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions. 10. The method of claim 9 , wherein the transformation is a linear transformation. 11. The method of claim 9 , wherein the transformation is a projection. 12. The method of claim 9 , wherein the transformation is a complex linear projection. 13. The method of claim 9 , wherein the transformation is a linear projection of energy. 14. The method of claim 1 , wherein the neural network comprises: the spatial filtering convolutional layer; at least one feature extraction neural network layer configured to determine frequency-based characteristics of the corresponding frequency-domain data converted from the spatially filtered output generated for each of the multiple spatial directions; and one or more neural network layers configured to receive output of the at least one feature extraction neural network layer and determine speech content using one or more recurrent neural network layers and one or more deep neural network layers. 15. The method of claim 1 , further comprising: detecting, by the data processing hardware, the first audio signal and the second audio signal using multiple microphones of a computing device, wherein the data processing hardware resides on the computing device. 16. The method of claim 1 , further comprising: detecting, by the data processing hardware, the first audio signal and the second audio signal using multiple microphones of a computing device; wherein the neural network is stored or implemented on the computing device. 17. The method of claim 1 , wherein processing the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions comprises identifying a voice command indicated by the first audio signal and the second audio signal. 18. The method of claim 1 , wherein the spatial filtering convolutional layer and the one or more additional layers have been jointly trained during training of the neural network. 19. A system comprising: one or more computing devices; and one or more computer-readable media storing instructions that, when executed by the one or more computing devices, cause the one or more computing devices to perform operations comprising: receiving a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time; obtaining a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal; generating, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one; converting the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data; and processing, using one or more additional neural network layers of the neural network, the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions to predict speech content encoded in the first audio signal and the second audio signal. 20. One or more non-transitory computer-readable media storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations comprising: receiving a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time; obtaining a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal; generating, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one; co
using properties of sound source · CPC title
Details of processing therefor · CPC title
Microphone arrays; Beamforming · CPC title
the noise being separate speech, e.g. cocktail party · CPC title
Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.