What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Processing multi-channel audio waveforms

US9697826B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9697826-B2
Application number	US-201615205321-A
Country	US
Kind code	B2
Filing date	Jul 8, 2016
Priority date	Mar 27, 2015
Publication date	Jul 4, 2017
Grant date	Jul 4, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 2. The system of claim 1 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 3. The system of claim 1 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers. 4. The system of claim 1 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution. 5. The system of claim 3 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers. 6. The system of claim 1 , wherein combining the convolution outputs comprises: summing, for each of the multiple filters, the convolution outputs obtained for different channels using the filter to generate summed outputs corresponding to different time periods; and pooling, for each of the multiple filters, the summed outputs across the different time periods to generated a set of pooled values for the filter. 7. The system of claim 6 , wherein pooling the summed outputs across the different time periods comprises max pooling the summed outputs across the different time periods to identify maximum values among the summed outputs for the different time periods. 8. The system of claim 6 , wherein combining the convolution outputs comprises applying a rectified non-linearity to the sets of pooled values for each of the multiple filters to obtain rectified values; wherein inputting the combined convolution outputs to the deep neural network comprises inputting the rectified values to the deep neural network. 9. The system of claim 8 , wherein the rectified non-linearity comprises a logarithm compression. 10. The system of claim 1 , wherein the filters are configured to perform both spatial and spectral filtering. 11. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model comprises training the multiple filters and the deep neural network using a single module of an automated speech recognizer. 12. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model is performed using training data that includes audio data from a plurality of different microphone spacing configurations. 13. A computer-implemented method comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 14. The method of claim 13 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 15. The method of claim 13 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers. 16. The method of claim 13 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution. 17. The method of claim 15 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers. 18. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs. 19. The non-transitory computer-readable medium of claim 18 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 20. The non-transitory computer-readable medium of claim 18 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.

Assignees

Google Inc

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/09
Supervised learning · CPC title

Patent family

Related publications grouped by family.

View patent family 57205501

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9697826B2 cover?: Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Robust speech recognition in the presence of echo and noise using multiple signals for discrimination

Multi-Microphone Speech Recognition Systems and Related Techniques

Deep neural net based filter prediction for audio event classification and extraction

Mixed speech recognition

Learning front-end speech recognition parameters within neural network training

Speech recognizer with multi-directional decoding

Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition

Frequently asked questions