Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System
US-2016111107-A1 · Apr 21, 2016 · US
US11756534B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11756534-B2 |
| Application number | US-202217649058-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 26, 2022 |
| Priority date | Mar 23, 2016 |
| Publication date | Sep 12, 2023 |
| Grant date | Sep 12, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising: receiving multiple channels of audio data each corresponding to an utterance; for each channel of the audio data among the multiple channels of the audio data: generating, using a filter prediction neural network, a respective set of filter parameters; and generating, using a respective finite impulse response filter applying the respective set of filter parameters generated for the channel of the audio data, a respective filtered output associated with the channel of the audio data; summing the filtered outputs associated with the multiple channels of the audio data into a summator output; and generating, using an acoustic model neural network configured to receive the summator output, an acoustic model output, the acoustic model output representing probability scores for each of a plurality of possible acoustic states, wherein the filter prediction neural network and the acoustic model neural network are jointly trained on training utterances using backpropagation through time (BPTT), each training utterance paired with a corresponding acoustic model output target. 2. The computer-implemented method of claim 1 , wherein the respective filtered output associated with the channel of the audio data is in a frequency domain. 3. The computer-implemented method of claim 1 , wherein a number of the multiple channels of the audio data is greater than two. 4. The computer-implemented method of claim 1 , wherein the respective set of filter parameters generated for the channel of the audio data is different than respective set of filter parameters generated for each other channel of the multiple channels of the audio data. 5. The computer-implementation method of claim 1 , wherein the multiple channels of the audio data comprise recordings of the utterance by different microphones that are spaced apart from each other. 6. The computer-implemented method of claim 1 , wherein the acoustic model neural network comprises one or more long-short term memory layers. 7. The computer-implemented method of claim 1 , wherein the acoustic model neural network further comprises a convolutional layer, one or more long-short term memory layers, and multiple hidden layers. 8. The computer-implemented method of claim 7 , wherein the convolutional layer of the acoustic model neural network is configured to perform a frequency domain convolution. 9. The computer-implemented method of claim 1 , wherein the filter prediction network comprises a plurality of long-short term memory layers. 10. The computer-implemented method of claim 1 , wherein the operations further comprise changing, or generating, new filter parameters for each input frame of audio data. 11. A system comprising: data processing hardware; and memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving multiple channels of audio data each corresponding to an utterance; for each channel of the audio data among the multiple channels of the audio data: generating, using a filter prediction neural network, a respective set of filter parameters; and generating, using a respective finite impulse response filter applying the respective set of filter parameters generated for the channel of the audio data, a respective filtered output associated with the channel of the audio data; summing the filtered outputs associated with the multiple channels of the audio data into a summator output; and generating, using an acoustic model neural network configured to receive the summator output, an acoustic model output, the acoustic model output representing probability scores for each of a plurality of possible acoustic states, wherein the filter prediction neural network and the acoustic model neural network are jointly trained on training utterances using backpropagation through time (BPTT), each training utterance paired with a corresponding acoustic model output target. 12. The system of claim 11 , wherein the respective filtered output associated with the channel of the audio data is in a frequency domain. 13. The system of claim 11 , wherein a number of the multiple channels of the audio data is greater than two. 14. The system of claim 11 , wherein the respective set of filter parameters generated for the channel of the audio data is different than respective set of filter parameters generated for each other channel of the multiple channels of the audio data. 15. The system of claim 11 , wherein the multiple channels of the audio data comprise recordings of the utterance by different microphones that are spaced apart from each other. 16. The system of claim 11 , wherein the acoustic model neural network comprises one or more long-short term memory layers. 17. The system of claim 11 , wherein the acoustic model neural network further comprises a convolutional layer, one or more long-short term memory layers, and multiple hidden layers. 18. The system of claim 17 , wherein the convolutional layer of the acoustic model neural network is configured to perform a frequency domain convolution. 19. The system of claim 11 , wherein the filter prediction network comprises a plurality of long-short term memory layers. 20. The system of claim 11 , wherein the operations further comprise changing, or generating, new filter parameters for each input frame of audio data.
using artificial neural networks · CPC title
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
Microphone arrays; Beamforming · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Processing in the time domain · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.