Systems and methods for noise reduction
US-11727926-B1 · Aug 15, 2023 · US
US12190896B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12190896-B2 |
| Application number | US-202217856292-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 1, 2022 |
| Priority date | Jul 2, 2021 |
| Publication date | Jan 7, 2025 |
| Grant date | Jan 7, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing an input audio waveform using a generator neural network to generate an output audio waveform. In one aspect, a method comprises: receiving an input audio waveform; processing the input audio waveform using an encoder neural network to generate a set of feature vectors representing the input audio waveform; and processing the set of feature vectors representing the input audio waveform using a decoder neural network to generate an output audio waveform that comprises a respective output audio sample for each of a plurality of output time steps.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving an input audio waveform that comprises a respective input audio sample for each of a plurality of input time steps; processing the input audio waveform using an encoder neural network to generate a set of feature vectors representing the input audio waveform, wherein the encoder neural network comprises a sequence of encoder blocks that are each configured to: process a respective set of input feature vectors in accordance with a set of encoder block parameters to generate a set of output feature vectors, comprising down-sampling the set of input feature vectors; and processing the set of feature vectors representing the input audio waveform using a decoder neural network to generate an output audio waveform that comprises a respective output audio sample for each of a plurality of output time steps, wherein the decoder neural network comprises a sequence of decoder blocks that are each configured to: process a respective set of input feature vectors in accordance with a set of decoder block parameters to generate a set of output feature vectors, comprising up-sampling the set of input feature vectors; wherein the output audio waveform represents a version of the input audio waveform that has been filtered to include only audio from a target audio source; and wherein the encoder neural network, the decoder neural network, or both additionally process a conditioning vector representing the target audio source. 2. The method of claim 1 , wherein each encoder block in the sequence of encoder blocks down-samples the set of input feature vectors to the encoder block using a respective strided convolution operation. 3. The method of claim 2 , wherein for each encoder block in the sequence of encoder blocks, the strided convolution operation is a one-dimensional or two-dimensional strided convolution operation. 4. The method of claim 1 , wherein for each encoder block in the sequence of encoder blocks, a dimensionality of the output feature vectors generated by the encoder block is higher than a dimensionality of the input feature vectors processed by the encoder block. 5. The method of claim 1 , wherein each decoder block in the sequence of decoder blocks up-samples the set of input feature vectors to the decoder block using a respective strided transposed convolution operation. 6. The method of claim 5 , wherein for each decoder block in the sequence of decoder blocks, the strided transposed convolution operation is a one-dimensional or two-dimensional strided transposed convolution operation. 7. The method of claim 1 , wherein for each decoder block in the sequence of decoder blocks, a dimensionality of the output feature vectors generated by the decoder block is lower than a dimensionality of the input feature vectors processed by the decoder block. 8. The method of claim 1 , wherein for each encoder block that is after a first encoder block in the sequence of encoder blocks, the set of input feature vectors to the encoder block comprises a set of output feature vectors generated by a preceding encoder block in the sequence of encoder blocks. 9. The method of claim 1 , wherein for each decoder block that is after a first decoder block in the sequence of decoder blocks, the set of input feature vectors to the decoder block comprises: (i) a set of output feature vectors of a corresponding encoder block, and (ii) a set of output feature vectors generated by a preceding decoder block in the sequence of decoder blocks. 10. The method of claim 1 , wherein the encoder neural network comprises a transform layer prior to the sequence of encoder blocks, wherein the transform layer maps the input audio waveform to an alternative representation in an alternative domain. 11. The method of claim 10 , wherein the transform layer maps the input audio waveform to an alternative representation in a time-frequency domain. 12. The method of claim 11 , wherein the transform layer implements a Fourier transform operation. 13. The method of claim 10 , wherein the decoder neural network comprises an inverse transform layer after the sequence of decoder blocks, wherein the inverse transform layer maps a representation of the output audio waveform in the alternative domain to a representation of the audio waveform in a time domain. 14. The method of claim 13 , wherein the inverse transform layer implements an inverse Fourier transform operation. 15. The method of claim 1 , wherein the encoder neural network and the decoder neural network are jointly trained, and the training comprises: obtaining a plurality of training examples that each include: (i) a respective input audio waveform, and (ii) a corresponding target audio waveform; processing the respective input audio waveform from each training example using the encoder neural network followed by the decoder neural network to generate an output audio waveform that is an estimate of the corresponding target audio waveform; determining gradients of an objective function that depends on the respective output waveform and respective target waveform for each training example; and using the gradients of the objective function to update a set of encoder neural network parameters and a set of decoder neural network parameters. 16. The method of claim 15 , wherein the training further comprises, for each training example: processing data derived from the output audio waveform using a discriminator neural network to generate a set of one or more discriminator scores, wherein each discriminator score characterizes an estimated likelihood that the output audio waveform is an audio waveform that was generated using the encoder neural network and the decoder neural network; wherein the objective function comprises an adversarial loss that depends on the discriminator scores generated by the discriminator neural network. 17. The method of claim 16 , wherein the data derived from the output audio waveform comprises the output audio waveform, a down-sampled version of the output audio waveform, or a Fourier-transformed version of the output audio waveform. 18. The method of claim 16 , wherein the training further comprises, for each training example: generating a respective set of discriminator scores using each of a plurality of discriminator neural networks, wherein each discriminator neural network processes a respective version of the output audio waveform that has been down-sampled by a respective factor; wherein the adversarial loss depends on the discriminator scores generated by the plurality of discriminator neural networks. 19. The method of claim 16 , wherein the discriminator neural network is trained to generate discriminator scores that distinguish between: (i) output audio waveforms generated using the encoder neural network and the decoder neural network, and (ii) target audio waveforms from training examples. 20. The method of claim 16 , wherein the discriminator neural network is a convolutional neural network, and wherein a number of discriminator scores in the set of discriminator scores generated by the discriminator neural network is proportional to a length of the output audio waveform. 21. The method of claim 16 , wherein the objective function comprises a reconstruction loss that, for each training example, measures an error between: (i) the output audio waveform, and (ii) the corresponding target audio waveform. 22. The method of c
Combinations of networks · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Generative networks · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.