Speech denoising via discrete representation learning

US11875809B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11875809-B2
Application numberUS-202017061317-A
CountryUS
Kind codeB2
Filing dateOct 1, 2020
Priority dateOct 1, 2020
Publication dateJan 16, 2024
Grant dateJan 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregressive generative model, it is learned via a variational autoencoder with discrete latent representations. Furthermore, in one or more embodiments, a new matching loss is presented for the denoising purpose, which is masked on when the corresponding latent codes differ. As compared against other method on test datasets, embodiments achieve competitive performance and can be trained from scratch.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for training a denoising system comprising: given a denoising system comprising a first encoder, a second encoder, a quantizer, and a decoder and given a set of one or more clean-noisy audio pairs, in which each clean-noisy audio pair comprises a clean audio of content by a speaker and a noisy audio of the content by the speaker: for each clean audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the clean audio using the first encoder; and for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer; for each noisy audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the noisy audio using the second encoder; and for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer; for each clean-noisy audio pair from the set of one or more clean-noisy audio pairs, inputting the discrete clean audio representation or representations, the clean audio, and a speaker embedding that represents the speaker of the clean-noisy audio pair into the decoder to generate an audio sequence prediction; computing a loss for the denoising system, in which the loss comprises a latent representation matching loss term that, for a time step in which the discrete clean audio representation and the discrete noisy audio representation for a clean-noisy audio pair have different values, is determined using a distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio for that time step; and updating the denoising system using the loss. 2. The computer-implemented method of claim 1 wherein the latent representation matching loss term further comprises: an annealing term that increases during training from zero or near zero to one or near one. 3. The computer-implemented method of claim 1 wherein the distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio comprises: an l 2 distance between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio. 4. The computer-implemented method of claim 1 wherein the loss comprises: a decoder term related to loss for the decoder; and a quantizer term related to loss for the quantizer. 5. The computer-implemented method of claim 1 wherein the quantizer comprises one or more vector-quantized variational autoencoders that convert the one or more continuous latent representations for clean audio to the corresponding one or more discrete clean audio representations and that convert the one or more continuous latent representations for noisy audio to the one or more corresponding discrete noisy audio representations. 6. The computer-implemented method of claim 1 further comprising: given one or more additional sets of one or more clean-noisy audio pairs: for each clean audio from the one or more additional sets of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the clean audio using the first encoder; and for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer; for each noisy audio from the one or more additional sets of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the noisy audio using the second encoder; and for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer; for each clean-noisy audio pair from the one or more additional sets of one or more clean-noisy audio pairs, inputting the discrete clean audio representation or representations, the clean audio, and a speaker embedding that represents the speaker of the clean-noisy audio pair into the decoder to generate an audio sequence prediction; computing a loss for the denoising system, in which the loss comprises a latent representation matching loss term that, for a time step in which the discrete clean audio representation and the discrete noisy audio representation for a clean-noisy audio pair have different values, is determined using a distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio for that time step; and updating the denoising system using the loss; and responsive to a stop condition being reached, outputting a trained denoising system comprising a trained second encoder, a trained quantizer, and a trained decoder. 7. The computer-implemented method of claim 6 further comprising: given an inference noisy audio for denoising and an inference speaker embedding for an inference speaker in the inference noisy audio: generating one or more continuous inference latent representations for the inference noisy audio using the trained second encoder; generating one or more discrete inference noisy audio representations using the one or more continuous inference latent representations for the inference noisy audio and the trained quantizer; and generating an inference denoised audio representation of the inference noisy audio by inputting at least some of the one or more discrete inference noisy audio representations and the inference speaker embedding that represents the inference speaker of the inference noisy audio into the trained decoder. 8. The computer-implemented method of claim 1 wherein the decoder is an autoregressive generative model that receives, a conditioner for the decoder, the discrete clean audio representations and the speaker embedding that represents the speaker of the clean audio but not the discrete noisy audio representations for the corresponding noisy audio. 9. A system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a denoising system comprising a first encoder, a second encoder, a quantizer, and a decoder and given a set of one or more clean-noisy audio pairs, in which each clean-noisy audio pair comprises a clean audio of content by a speaker and a noisy audio of the content by the speaker: for each clean audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the clean audio using the first encoder; and for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer; for each noisy audio from the set of one or more clean-noisy audio pairs: generating one or more continuous latent representations for the noisy audio using the second encoder; and for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer; for each clean-noisy audio pair from the set of one or more clean-noisy audio pairs, inputting the discrete clean audi

Assignees

Inventors

Classifications

  • Generative networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11875809B2 cover?
Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregr…
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G10L21/0208. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).