Systems and methods for noise cancellation
US-11521635-B1 · Dec 6, 2022 · US
US12119014B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12119014-B2 |
| Application number | US-202117644108-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 14, 2021 |
| Priority date | Aug 9, 2021 |
| Publication date | Oct 15, 2024 |
| Grant date | Oct 15, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, at a contextual frontend processing model, input speech features corresponding a target utterance and at least one of: a reference audio signal; a contextual noise signal comprising noise prior to the target utterance; or a speaker embedding vector comprising voice characteristics of a target speaker that spoke the target utterance; and processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features by: processing, using a primary encoder, the input speech features to generate a main input encoding; processing, using a noise context encoder, the contextual noise signal to generate a contextual noise encoding; processing, using a cross-attention encoder, the main input encoding and the contextual noise encoding to generate a cross-attention embedding; and decoding the cross-attention embedding into the enhanced speech features corresponding to the target utterance. 2. The computer-implemented method of claim 1 , wherein the contextual frontend processing model comprises a conformer neural network architecture that combines convolution and self-attention to model short-range and long-range interactions. 3. The computer-implemented method of claim 1 , wherein processing the input speech features to generate the main input encoding further comprises processing the input speech features stacked with reference features corresponding to the reference audio signal to generate the main input encoding. 4. The computer-implemented method of claim 3 , wherein the input speech features and the reference features each comprise a respective sequence of log Mel-filterbank energy (LFBE) features. 5. The computer-implemented method of claim 1 , wherein: processing the input speech features to generate the main input encoding comprises combining the input speech features with the speaker embedding vector using feature-wise linear modulation (FiLM) to generate the main input encoding; and processing the main input encoding and the contextual noise encoding to generate the cross-attention embedding comprises: combining the main input encoding with the speaker embedding vector using FiLM to generate a modulated main input encoding; and processing the modulated main input encoding and the contextual noise encoding to generate the cross-attention embedding. 6. The computer-implemented method of claim 1 , wherein: the primary encoder comprises N modulated conformer blocks; the noise context encoder comprises N conformer blocks and executes in parallel with the primary encoder; and the cross-attention encoder comprises M modulated cross-attention conformer blocks. 7. The computer-implemented method of claim 1 , wherein the data processing hardware executes the contextual frontend processing model and resides on a user device, the user device configured to: output the reference audio signal as playback audio via an audio speaker of the user device; and capture the target utterance, the reference audio signal, and the contextual noise signal via one or more microphones of the user device. 8. The computer-implemented method of claim 1 , wherein contextual frontend processing model is trained jointly with a backend automatic speech recognition (ASR) model using a spectral loss and an ASR loss. 9. The computer-implemented method of claim 8 , wherein the spectral loss is based on an L1 loss function and L2 loss function distance between an estimated ratio mask and an ideal ratio mask, the ideal ratio mask computed using reverberant speech and reverberant noise. 10. The computer-implemented method of claim 8 , wherein the ASR loss is computed by: generating, using an ASR encoder of the ASR model configured to receive enhanced speech features predicted by the contextual frontend processing model for a training utterance as input, predicted outputs of the ASR encoder for the enhanced speech features; generating, using the ASR encoder configured to receive target speech features for the training utterance as input, target outputs of the ASR encoder for the target speech features; and computing the ASR loss based on the predicted outputs of the ASR encoder for the enhanced speech features and the target outputs of the ASR encoder for the target speech features. 11. The computer-implemented method of claim 1 , wherein the operations further comprise processing, using a backend speech system, the enhanced speech features corresponding to the target utterance. 12. The computer-implemented method of claim 11 , wherein the backend speech system comprises at least one of: an automatic speech recognition (ASR) model; a hotword detection model; or an audio or audio-video calling application. 13. A contextual frontend processing model comprising: a primary encoder configured to: receive, as input, input speech features corresponding to a target utterance; and generate, as output, a main input encoding; a noise context encoder configured to: receive, as input, a contextual noise signal comprising noise prior to the target utterance; and generate, as output, a contextual noise encoding; and a cross-attention encoder configured to: receive, as input, the main input encoding generated as output from the primary encoder and the contextual noise encoding generated as output from the noise context encoder; and generate, as output, a cross-attention embedding; and a decoder configured to decode the cross-attention embedding into enhanced speech features corresponding to the target utterance. 14. The contextual frontend processing model of claim 13 , wherein the primary encoder is further configured to: receive, as input, reference features corresponding to a reference audio signal; and generate, as output, the main input encoding by processing the input speech features stacked with the reference features. 15. The contextual frontend processing model of claim 14 , wherein the input speech features and the reference features each comprise a respective sequence of log Mel-filterbank energy (LFBE) features. 16. The contextual frontend processing model of claim 13 , wherein the primary encoder is further configured to: receive, as input, a speaker embedding comprising voice characteristics of a target speaker that spoke the target utterance; and generate, as output, the main input encoding by combining the input speech features with the speaker embedding using feature-wise linear modulation (FiLM). 17. The contextual frontend processing model of claim 13 , wherein the cross-attention encoder is further configured to: receive, as input, the main input encoding modulated by a speaker embedding using feature-wise linear modulation (FiLM), the speaker embedding comprising voice characteristics of a target speaker that spoke the target utterance; and process the main input encoding modulated by the speaker embedding and the contextual noise encoding to generate, as output, the cross-attention embedding. 18. The contextual frontend processing model of claim 13 , wherein: the primary encoder comprises N modulated conformer blocks; the noise context encoder comprises N conformer blocks and executes in parallel with the primary encoder; and the cross-attention encoder comprises M modulated cross-attention confor
for correcting frequency response · CPC title
the noise being echo, reverberation of the speech · CPC title
Training · CPC title
Architecture, e.g. interconnection topology · CPC title
the noise being separate speech, e.g. cocktail party · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.