Generalized automatic speech recognition for joint acoustic echo cancellation, speech enhancement, and voice separation

US12400672B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12400672-B2
Application numberUS-202318171368-A
CountryUS
Kind codeB2
Filing dateFeb 19, 2023
Priority dateMar 20, 2022
Publication dateAug 26, 2025
Grant dateAug 26, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for training a generalized automatic speech recognition model for joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving a plurality of training utterances paired with corresponding training contextual signals. The training contextual signals include a training contextual noise signal including noise prior to the corresponding training utterance, a training reference audio signal, and a training speaker vector including voice characteristics of a target speaker that spoke the corresponding training utterance. The operations also include training, using a contextual signal dropout strategy, a contextual frontend processing model on the training utterances to learn how to predict enhanced speech features. Here, the contextual signal dropout strategy uses a predetermined probability to drop out each of the training contextual signals during training of the contextual frontend processing model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a plurality of training utterances paired with corresponding training contextual signals, the training contextual signals comprising: a training contextual noise signal comprising noise prior to the corresponding training utterance; a training reference audio signal; and a training speaker vector comprising voice characteristics of a target speaker that spoke the corresponding training utterance; and training, using a contextual signal dropout strategy, a contextual frontend processing model on the training utterances to learn how to predict enhanced speech features by, for each training utterance of the plurality of training utterances: generating, using a speech encoder configured to receive enhanced input speech features predicted by the contextual frontend processing model for the training utterance as input using the contextual signal dropout strategy, predicted outputs of the speech encoder for the enhanced input speech features; generating, using the speech encoder configured to receive target speech features for the training utterance as input, target outputs of the speech encoder for the target speech features; and computing a loss based on the predicted outputs of the speech encoder for the enhanced speech features and the target outputs of the speech encoder for the target speech features, wherein the contextual signal dropout strategy uses a predetermined probability to drop out each of the training contextual signals during training of the contextual frontend processing model. 2. The computer-implemented method of claim 1 , wherein the signal dropout strategy drops out each training contextual signal by replacing the corresponding contextual signal with all-zeroes. 3. The computer-implemented method of claim 2 , wherein replacing the training reference audio signal with all zeroes comprises replacing the training reference audio signal with an all-zero feature of a same length and feature dimension as the corresponding training utterance. 4. The computer-implemented method of claim 2 , wherein replacing the training contextual noise signal comprises replacing the training contextual noise signal with an all-zero feature having a predetermined length and a same feature dimension as the corresponding training utterance. 5. The computer-implemented method of claim 2 , wherein replacing the training speaker vector comprises replacing the training speaker vector with an all-zero feature with an all-zero vector. 6. The computer-implemented method of claim 1 , wherein the signal dropout strategy drops out each training contextual signal by replacing the corresponding contextual signal with a frame-level learned representation. 7. The computer-implemented method of claim 1 , wherein the trained contextual frontend processing model comprises: a primary encoder configured to: receive, as input, input speech features corresponding to a target utterance; and generate, as output, a main input encoding; a noise context encoder configured to: receive, as input, a contextual noise signal comprising noise prior to the target utterance; and generate, as output, a contextual noise encoding; and a cross-attention encoder configured to: receive, as input, the main input encoding generated as output from the primary encoder and the contextual noise encoding generated as output from the noise context encoder; and generate, as output, a cross-attention embedding; and a decoder configured to decode the cross-attention embedding into enhanced speech features corresponding to the target utterance. 8. The computer-implemented method of claim 7 , wherein the primary encoder is further configured to: receive, as input, reference features corresponding to a reference audio signal; and generate, as output, the main input encoding by processing the input speech features stacked with the reference features. 9. The computer-implemented method of claim 7 , wherein the primary encoder is further configured to: receive, as input, a speaker embedding comprising voice characteristics of a target speaker that spoke the target utterance; and generate, as output, the main input encoding by combining the input speech features with the speaker embedding using feature-wise linear modulation (FILM). 10. The computer-implemented method of claim 7 , wherein the cross-attention encoder is further configured to: receive, as input, the main input encoding modulated by a speaker embedding using feature-wise linear modulation (FILM), the speaker embedding comprising voice characteristics of a target speaker that spoke the target utterance; and process the main input encoding modulated by the speaker embedding and the contextual noise encoding to generate, as output, the cross-attention embedding. 11. The computer-implemented method of claim 7 , wherein: the primary encoder comprises N modulated conformer blocks; the noise context encoder comprises N conformer blocks and executes in parallel with the primary encoder; and the cross-attention encoder comprises M modulated cross-attention conformer blocks. 12. The computer-implemented method of claim 1 , wherein the contextual frontend processing model is trained jointly with a backend automatic speech recognition (ASR) model using a spectral loss and an ASR loss. 13. The computer-implemented method of claim 12 , wherein the spectral loss is based on an L1 loss function and L2 loss function distance between an estimated ratio mask and an ideal ratio mask, the ideal ratio mask computed using reverberant speech and reverberant noise. 14. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising: receiving a plurality of training utterances paired with corresponding training contextual signals, the training contextual signals comprising: a training contextual noise signal comprising noise prior to the corresponding training utterance; a training reference audio signal; and a training speaker vector comprising voice characteristics of a target speaker that spoke the corresponding training utterance; and training, using a contextual signal dropout strategy, a contextual frontend processing model on the training utterances to learn how to predict enhanced speech features by, for each training utterance of the plurality of training utterances: generating, using a speech encoder configured to receive enhanced input speech features predicted by the contextual frontend processing model for the training utterance as input using the contextual signal dropout strategy, predicted outputs of the speech encoder for the enhanced input speech features; generating, using the speech encoder configured to receive target speech features for the training utterance as input, target outputs of the speech encoder for the target speech features; and computing a loss based on the predicted outputs of the speech encoder for the enhanced speech features and the target outputs of the speech encoder for the target speech features, wherein the contextual signal dropout strategy uses a predetermined probability to drop out each of the training contextual signals during training of the contextual frontend processing model. 15. The system of claim 14 , wherein the signal dropout strategy drops out each trainin

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12400672B2 cover?
A method for training a generalized automatic speech recognition model for joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving a plurality of training utterances paired with corresponding training contextual signals. The training contextual signals include a training contextual noise signal including noise prior to the corresponding training utterance, …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 26 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).