What technology area does this patent fall under?

Primary CPC classification G10L21/0208. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speech denoising via discrete representation learning

US11875809B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11875809-B2
Application number	US-202017061317-A
Country	US
Kind code	B2
Filing date	Oct 1, 2020
Priority date	Oct 1, 2020
Publication date	Jan 16, 2024
Grant date	Jan 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregressive generative model, it is learned via a variational autoencoder with discrete latent representations. Furthermore, in one or more embodiments, a new matching loss is presented for the denoising purpose, which is masked on when the corresponding latent codes differ. As compared against other method on test datasets, embodiments achieve competitive performance and can be trained from scratch.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for training a denoising system comprising: given a denoising system comprising a first encoder, a second encoder, a quantizer, and a decoder and given a set of one or more clean-noisy audio pairs, in which each clean-noisy audio pair comprises a clean audio of content by a speaker and a noisy audio of the content by the speaker: for each clean audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the clean audio using the first encoder; and for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer; for each noisy audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the noisy audio using the second encoder; and for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer; for each clean-noisy audio pair from the set of one or more clean-noisy audio pairs, inputting the discrete clean audio representation or representations, the clean audio, and a speaker embedding that represents the speaker of the clean-noisy audio pair into the decoder to generate an audio sequence prediction; computing a loss for the denoising system, in which the loss comprises a latent representation matching loss term that, for a time step in which the discrete clean audio representation and the discrete noisy audio representation for a clean-noisy audio pair have different values, is determined using a distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio for that time step; and updating the denoising system using the loss. 2. The computer-implemented method of claim 1 wherein the latent representation matching loss term further comprises: an annealing term that increases during training from zero or near zero to one or near one. 3. The computer-implemented method of claim 1 wherein the distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio comprises: an l 2 distance between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio. 4. The computer-implemented method of claim 1 wherein the loss comprises: a decoder term related to loss for the decoder; and a quantizer term related to loss for the quantizer. 5. The computer-implemented method of claim 1 wherein the quantizer comprises one or more vector-quantized variational autoencoders that convert the one or more continuous latent representations for clean audio to the corresponding one or more discrete clean audio representations and that convert the one or more continuous latent representations for noisy audio to the one or more corresponding discrete noisy audio representations. 6. The computer-implemented method of claim 1 further comprising: given one or more additional sets of one or more clean-noisy audio pairs: for each clean audio from the one or more additional sets of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the clean audio using the first encoder; and for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer; for each noisy audio from the one or more additional sets of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the noisy audio using the second encoder; and for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer; for each clean-noisy audio pair from the one or more additional sets of one or more clean-noisy audio pairs, inputting the discrete clean audio representation or representations, the clean audio, and a speaker embedding that represents the speaker of the clean-noisy audio pair into the decoder to generate an audio sequence prediction; computing a loss for the denoising system, in which the loss comprises a latent representation matching loss term that, for a time step in which the discrete clean audio representation and the discrete noisy audio representation for a clean-noisy audio pair have different values, is determined using a distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio for that time step; and updating the denoising system using the loss; and responsive to a stop condition being reached, outputting a trained denoising system comprising a trained second encoder, a trained quantizer, and a trained decoder. 7. The computer-implemented method of claim 6 further comprising: given an inference noisy audio for denoising and an inference speaker embedding for an inference speaker in the inference noisy audio: generating one or more continuous inference latent representations for the inference noisy audio using the trained second encoder; generating one or more discrete inference noisy audio representations using the one or more continuous inference latent representations for the inference noisy audio and the trained quantizer; and generating an inference denoised audio representation of the inference noisy audio by inputting at least some of the one or more discrete inference noisy audio representations and the inference speaker embedding that represents the inference speaker of the inference noisy audio into the trained decoder. 8. The computer-implemented method of claim 1 wherein the decoder is an autoregressive generative model that receives, a conditioner for the decoder, the discrete clean audio representations and the speaker embedding that represents the speaker of the clean audio but not the discrete noisy audio representations for the corresponding noisy audio. 9. A system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a denoising system comprising a first encoder, a second encoder, a quantizer, and a decoder and given a set of one or more clean-noisy audio pairs, in which each clean-noisy audio pair comprises a clean audio of content by a speaker and a noisy audio of the content by the speaker: for each clean audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the clean audio using the first encoder; and for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer; for each noisy audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the noisy audio using the second encoder; and for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer; for each clean-noisy audio pair from the set of one or more clean-noisy audio pairs, inputting the discrete clean audi

Assignees

Baidu Usa Llc

Inventors

Classifications

G06N3/0475
Generative networks · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

View patent family 80824782

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11875809B2 cover?: Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregr…
Who is the assignee on this patent?: Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G10L21/0208. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Apparatus for noise canceling and method for the same

Systems and methods for robust speech recognition using generative adversarial networks

Generating music with deep neural networks

Frequently asked questions