Adaptive audio enhancement for multichannel speech recognition

US11756534B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11756534-B2
Application numberUS-202217649058-A
CountryUS
Kind codeB2
Filing dateJan 26, 2022
Priority dateMar 23, 2016
Publication dateSep 12, 2023
Grant dateSep 12, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising: receiving multiple channels of audio data each corresponding to an utterance; for each channel of the audio data among the multiple channels of the audio data: generating, using a filter prediction neural network, a respective set of filter parameters; and generating, using a respective finite impulse response filter applying the respective set of filter parameters generated for the channel of the audio data, a respective filtered output associated with the channel of the audio data; summing the filtered outputs associated with the multiple channels of the audio data into a summator output; and generating, using an acoustic model neural network configured to receive the summator output, an acoustic model output, the acoustic model output representing probability scores for each of a plurality of possible acoustic states, wherein the filter prediction neural network and the acoustic model neural network are jointly trained on training utterances using backpropagation through time (BPTT), each training utterance paired with a corresponding acoustic model output target. 2. The computer-implemented method of claim 1 , wherein the respective filtered output associated with the channel of the audio data is in a frequency domain. 3. The computer-implemented method of claim 1 , wherein a number of the multiple channels of the audio data is greater than two. 4. The computer-implemented method of claim 1 , wherein the respective set of filter parameters generated for the channel of the audio data is different than respective set of filter parameters generated for each other channel of the multiple channels of the audio data. 5. The computer-implementation method of claim 1 , wherein the multiple channels of the audio data comprise recordings of the utterance by different microphones that are spaced apart from each other. 6. The computer-implemented method of claim 1 , wherein the acoustic model neural network comprises one or more long-short term memory layers. 7. The computer-implemented method of claim 1 , wherein the acoustic model neural network further comprises a convolutional layer, one or more long-short term memory layers, and multiple hidden layers. 8. The computer-implemented method of claim 7 , wherein the convolutional layer of the acoustic model neural network is configured to perform a frequency domain convolution. 9. The computer-implemented method of claim 1 , wherein the filter prediction network comprises a plurality of long-short term memory layers. 10. The computer-implemented method of claim 1 , wherein the operations further comprise changing, or generating, new filter parameters for each input frame of audio data. 11. A system comprising: data processing hardware; and memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving multiple channels of audio data each corresponding to an utterance; for each channel of the audio data among the multiple channels of the audio data: generating, using a filter prediction neural network, a respective set of filter parameters; and generating, using a respective finite impulse response filter applying the respective set of filter parameters generated for the channel of the audio data, a respective filtered output associated with the channel of the audio data; summing the filtered outputs associated with the multiple channels of the audio data into a summator output; and generating, using an acoustic model neural network configured to receive the summator output, an acoustic model output, the acoustic model output representing probability scores for each of a plurality of possible acoustic states, wherein the filter prediction neural network and the acoustic model neural network are jointly trained on training utterances using backpropagation through time (BPTT), each training utterance paired with a corresponding acoustic model output target. 12. The system of claim 11 , wherein the respective filtered output associated with the channel of the audio data is in a frequency domain. 13. The system of claim 11 , wherein a number of the multiple channels of the audio data is greater than two. 14. The system of claim 11 , wherein the respective set of filter parameters generated for the channel of the audio data is different than respective set of filter parameters generated for each other channel of the multiple channels of the audio data. 15. The system of claim 11 , wherein the multiple channels of the audio data comprise recordings of the utterance by different microphones that are spaced apart from each other. 16. The system of claim 11 , wherein the acoustic model neural network comprises one or more long-short term memory layers. 17. The system of claim 11 , wherein the acoustic model neural network further comprises a convolutional layer, one or more long-short term memory layers, and multiple hidden layers. 18. The system of claim 17 , wherein the convolutional layer of the acoustic model neural network is configured to perform a frequency domain convolution. 19. The system of claim 11 , wherein the filter prediction network comprises a plurality of long-short term memory layers. 20. The system of claim 11 , wherein the operations further comprise changing, or generating, new filter parameters for each input frame of audio data.

Assignees

Inventors

Classifications

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title

  • Microphone arrays; Beamforming · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Processing in the time domain · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11756534B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further in…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 12 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).