What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 06 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Adaptive audio enhancement for multichannel speech recognition

US9886949B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9886949-B2
Application number	US-201615392122-A
Country	US
Kind code	B2
Filing date	Dec 28, 2016
Priority date	Mar 23, 2016
Publication date	Feb 6, 2018
Grant date	Feb 6, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance; generating, using a trained recurrent neural network, (i) a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and (ii) a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data; generating a single combined channel of audio data by combining (i) audio data of the first channel that has been filtered using the first filter and (ii) audio data of the second channel that has been filtered using the second filter; inputting the audio data for the single combined channel to a neural network trained as an acoustic model; and providing a transcription for the utterance that is determined based at least on output that the neural network provides in response to receiving the audio data for the single combined channel. 2. The method of claim 1 , wherein the recurrent neural network comprises one or more long short-term memory layers. 3. The method of claim 1 , wherein the recurrent neural network comprises: a first long short-term memory layer that receives both first and second channels of audio; and a second long short-term memory layer corresponding to the first channel and a third long short-term memory layer corresponding to the second channel, the second long short-term memory layer and the third long short-term memory layer each receiving the output of the first long short-term memory layer and providing a set of filter parameters for the corresponding channel. 4. The method of claim 3 , wherein the long short-term memory layer layers have parameters that have been learned during a training process that jointly trains the long short-term memory layers and the neural network that is trained as an acoustic model. 5. The method of claim 1 , comprising: changing, or generating, new filter parameters for each input frame of audio data. 6. The method of claim 1 , comprising: for each audio frame in a sequence of audio frames of the utterance, generating and a new set of filter parameters and convolving audio data for the frame with a filter with the new set of filter parameters. 7. The method of claim 1 , wherein the first filter and the second filter are finite impulse response filters. 8. The method of claim 1 , wherein the first filter and the second filter have different parameters. 9. The method of claim 1 , wherein different microphone outputs are convolved with different filters. 10. The method of claim 1 , wherein the first and second channels of audio data are first and second channels of audio waveform data for the utterance, wherein the first and second channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other. 11. The method of claim 1 , wherein the neural network trained as an acoustic model comprises a convolutional layer, one or more long-short term memory layers, and multiple hidden layers. 12. The method of claim 11 , wherein the convolutional layer of the neural network trained as an acoustic model is configured to perform a time domain convolution. 13. The method of claim 11 , wherein the neural network trained as an acoustic model is configured such that output of the convolutional layer is pooled to generate a set of pooled values. 14. The method of claim 13 , wherein the neural network trained as an acoustic model is configured to input the pooled values to one or more long-short term memory layers within the neural network trained as an acoustic model. 15. The method of claim 1 , wherein the first and second filters are configured to perform both spatial and spectral filtering. 16. The method of claim 1 , comprising: convolving the audio data for the first channel with a first filter having the first set of filter parameters to generate first convolution outputs; convolving the audio data for the second channel with a second filter having the second set of filter parameters to generate second convolution outputs; and combining the first convolution outputs and the second convolution outputs. 17. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance; generating, using a trained recurrent neural network, (i) a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and (ii) a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data; generating a single combined channel of audio data by combining (i) audio data of the first channel that has been filtered using the first filter and (ii) audio data of the second channel that has been filtered using the second filter; inputting the audio data for the single combined channel to a neural network trained as an acoustic model; and providing a transcription for the utterance that is determined based at least on output that the neural network provides in response to receiving the audio data for the single combined channel. 18. The system of claim 17 , wherein the recurrent neural network comprises: a first long short-term memory layer that receives both first and second channels of audio; and a second long short-term memory layer corresponding to the first channel and a third long short-term memory layer corresponding to the second channel, the second long short-term memory layer and the third long short-term memory layer each receiving the output of the first long short-term memory layer and providing a set of filter parameters for the corresponding channel. 19. The system of claim 17 , wherein the operations further comprise: convolving the audio data for the first channel with a first filter having the first set of filter parameters to generate first convolution outputs; convolving the audio data for the second channel with a second filter having the second set of filter parameters to generate second convolution outputs; combining the first convolution outputs and the second convolution outputs. 20. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance; generating, using a trained recurrent neural network, (i) a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and (ii) a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data; generating a single combined channel of audio data by combining (i) audio data of the first channel that has been filtered using the first filter and (ii) audio data of the second channel that has been filtered using the second filter; inputting the audio data for the single combined channel to a neural netwo

Assignees

Google Inc

Inventors

Classifications

G10L15/20
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
G10L21/0216
characterised by the method used for estimating noise · CPC title
G10L15/26
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L21/0224
Processing in the time domain · CPC title
G10L2021/02166
Microphone arrays; Beamforming · CPC title

Patent family

Related publications grouped by family.

View patent family 57799910

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9886949B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further in…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 06 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).