What technology area does this patent fall under?

Primary CPC classification G10L21/0216. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 15 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Joint acoustic echo cancelation, speech enhancement, and voice separation for automatic speech recognition

US12119014B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12119014-B2
Application number	US-202117644108-A
Country	US
Kind code	B2
Filing date	Dec 14, 2021
Priority date	Aug 9, 2021
Publication date	Oct 15, 2024
Grant date	Oct 15, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, at a contextual frontend processing model, input speech features corresponding a target utterance and at least one of: a reference audio signal; a contextual noise signal comprising noise prior to the target utterance; or a speaker embedding vector comprising voice characteristics of a target speaker that spoke the target utterance; and processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features by: processing, using a primary encoder, the input speech features to generate a main input encoding; processing, using a noise context encoder, the contextual noise signal to generate a contextual noise encoding; processing, using a cross-attention encoder, the main input encoding and the contextual noise encoding to generate a cross-attention embedding; and decoding the cross-attention embedding into the enhanced speech features corresponding to the target utterance. 2. The computer-implemented method of claim 1 , wherein the contextual frontend processing model comprises a conformer neural network architecture that combines convolution and self-attention to model short-range and long-range interactions. 3. The computer-implemented method of claim 1 , wherein processing the input speech features to generate the main input encoding further comprises processing the input speech features stacked with reference features corresponding to the reference audio signal to generate the main input encoding. 4. The computer-implemented method of claim 3 , wherein the input speech features and the reference features each comprise a respective sequence of log Mel-filterbank energy (LFBE) features. 5. The computer-implemented method of claim 1 , wherein: processing the input speech features to generate the main input encoding comprises combining the input speech features with the speaker embedding vector using feature-wise linear modulation (FiLM) to generate the main input encoding; and processing the main input encoding and the contextual noise encoding to generate the cross-attention embedding comprises: combining the main input encoding with the speaker embedding vector using FiLM to generate a modulated main input encoding; and processing the modulated main input encoding and the contextual noise encoding to generate the cross-attention embedding. 6. The computer-implemented method of claim 1 , wherein: the primary encoder comprises N modulated conformer blocks; the noise context encoder comprises N conformer blocks and executes in parallel with the primary encoder; and the cross-attention encoder comprises M modulated cross-attention conformer blocks. 7. The computer-implemented method of claim 1 , wherein the data processing hardware executes the contextual frontend processing model and resides on a user device, the user device configured to: output the reference audio signal as playback audio via an audio speaker of the user device; and capture the target utterance, the reference audio signal, and the contextual noise signal via one or more microphones of the user device. 8. The computer-implemented method of claim 1 , wherein contextual frontend processing model is trained jointly with a backend automatic speech recognition (ASR) model using a spectral loss and an ASR loss. 9. The computer-implemented method of claim 8 , wherein the spectral loss is based on an L1 loss function and L2 loss function distance between an estimated ratio mask and an ideal ratio mask, the ideal ratio mask computed using reverberant speech and reverberant noise. 10. The computer-implemented method of claim 8 , wherein the ASR loss is computed by: generating, using an ASR encoder of the ASR model configured to receive enhanced speech features predicted by the contextual frontend processing model for a training utterance as input, predicted outputs of the ASR encoder for the enhanced speech features; generating, using the ASR encoder configured to receive target speech features for the training utterance as input, target outputs of the ASR encoder for the target speech features; and computing the ASR loss based on the predicted outputs of the ASR encoder for the enhanced speech features and the target outputs of the ASR encoder for the target speech features. 11. The computer-implemented method of claim 1 , wherein the operations further comprise processing, using a backend speech system, the enhanced speech features corresponding to the target utterance. 12. The computer-implemented method of claim 11 , wherein the backend speech system comprises at least one of: an automatic speech recognition (ASR) model; a hotword detection model; or an audio or audio-video calling application. 13. A contextual frontend processing model comprising: a primary encoder configured to: receive, as input, input speech features corresponding to a target utterance; and generate, as output, a main input encoding; a noise context encoder configured to: receive, as input, a contextual noise signal comprising noise prior to the target utterance; and generate, as output, a contextual noise encoding; and a cross-attention encoder configured to: receive, as input, the main input encoding generated as output from the primary encoder and the contextual noise encoding generated as output from the noise context encoder; and generate, as output, a cross-attention embedding; and a decoder configured to decode the cross-attention embedding into enhanced speech features corresponding to the target utterance. 14. The contextual frontend processing model of claim 13 , wherein the primary encoder is further configured to: receive, as input, reference features corresponding to a reference audio signal; and generate, as output, the main input encoding by processing the input speech features stacked with the reference features. 15. The contextual frontend processing model of claim 14 , wherein the input speech features and the reference features each comprise a respective sequence of log Mel-filterbank energy (LFBE) features. 16. The contextual frontend processing model of claim 13 , wherein the primary encoder is further configured to: receive, as input, a speaker embedding comprising voice characteristics of a target speaker that spoke the target utterance; and generate, as output, the main input encoding by combining the input speech features with the speaker embedding using feature-wise linear modulation (FiLM). 17. The contextual frontend processing model of claim 13 , wherein the cross-attention encoder is further configured to: receive, as input, the main input encoding modulated by a speaker embedding using feature-wise linear modulation (FiLM), the speaker embedding comprising voice characteristics of a target speaker that spoke the target utterance; and process the main input encoding modulated by the speaker embedding and the contextual noise encoding to generate, as output, the cross-attention embedding. 18. The contextual frontend processing model of claim 13 , wherein: the primary encoder comprises N modulated conformer blocks; the noise context encoder comprises N conformer blocks and executes in parallel with the primary encoder; and the cross-attention encoder comprises M modulated cross-attention confor

Assignees

Google Llc

Inventors

Classifications

H04R3/04
for correcting frequency response · CPC title
G10L2021/02082
the noise being echo, reverberation of the speech · CPC title
G10L15/063
Training · CPC title
G06N3/04
Architecture, e.g. interconnection topology · CPC title
G10L2021/02087
the noise being separate speech, e.g. cocktail party · CPC title

Patent family

Related publications grouped by family.

View patent family 79425569

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12119014B2 cover?: A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal inc…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L21/0216. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 15 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and methods for noise cancellation

Context-based speech enhancement

Speech Recognition Method and Apparatus, and Computer-Readable Storage Medium

System and method for acoustic echo cancelation using deep multitask recurrent neural networks

System and method for performing speech enhancement using a deep neural network-based signal

Acoustic echo cancellation and automatic speech recognition with random noise

Frequently asked questions