Processing multi-channel audio waveforms

US9697826B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9697826-B2
Application numberUS-201615205321-A
CountryUS
Kind codeB2
Filing dateJul 8, 2016
Priority dateMar 27, 2015
Publication dateJul 4, 2017
Grant dateJul 4, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 2. The system of claim 1 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 3. The system of claim 1 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers. 4. The system of claim 1 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution. 5. The system of claim 3 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers. 6. The system of claim 1 , wherein combining the convolution outputs comprises: summing, for each of the multiple filters, the convolution outputs obtained for different channels using the filter to generate summed outputs corresponding to different time periods; and pooling, for each of the multiple filters, the summed outputs across the different time periods to generated a set of pooled values for the filter. 7. The system of claim 6 , wherein pooling the summed outputs across the different time periods comprises max pooling the summed outputs across the different time periods to identify maximum values among the summed outputs for the different time periods. 8. The system of claim 6 , wherein combining the convolution outputs comprises applying a rectified non-linearity to the sets of pooled values for each of the multiple filters to obtain rectified values; wherein inputting the combined convolution outputs to the deep neural network comprises inputting the rectified values to the deep neural network. 9. The system of claim 8 , wherein the rectified non-linearity comprises a logarithm compression. 10. The system of claim 1 , wherein the filters are configured to perform both spatial and spectral filtering. 11. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model comprises training the multiple filters and the deep neural network using a single module of an automated speech recognizer. 12. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model is performed using training data that includes audio data from a plurality of different microphone spacing configurations. 13. A computer-implemented method comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 14. The method of claim 13 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 15. The method of claim 13 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers. 16. The method of claim 13 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution. 17. The method of claim 15 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers. 18. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 19. The non-transitory computer-readable medium of claim 18 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 20. The non-transitory computer-readable medium of claim 18 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9697826B2 cover?
Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).