Spectrogram to waveform synthesis using convolutional networks

US11462209B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11462209-B2
Application numberUS-201916365673-A
CountryUS
Kind codeB2
Filing dateMar 27, 2019
Priority dateMay 18, 2018
Publication dateOct 4, 2022
Grant dateOct 4, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

For the problem of waveform synthesis from spectrograms, presented herein are embodiments of an efficient neural network architecture, based on transposed convolutions to achieve a high compute intensity and fast inference. In one or more embodiments, for training of the convolutional vocoder architecture, losses are used that are related to perceptual audio quality, as well as a GAN framework to guide with a critic that discerns unrealistic waveforms. While yielding a high-quality audio, embodiments of the model can achieve more than 500 times faster than real-time audio synthesis. Multi-head convolutional neural network (MCNN) embodiments for waveform synthesis from spectrograms are also disclosed. MCNN embodiments enable significantly better utilization of modern multi-core processors than commonly-used iterative algorithms like Griffin-Lim and yield very fast (more than 300× real-time) waveform synthesis. Embodiments herein yield high-quality speech synthesis, without any iterative algorithms or autoregression in computations.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for training a neural network model for spectrogram inversion comprising: inputting an input spectrogram comprising a number of frequency channels into a convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers; outputting from the CNN a synthesized waveform for the input spectrogram, the input spectrogram having a corresponding ground truth waveform; using the corresponding ground truth waveform, the synthesized waveform, and a loss function comprising at least one or more loss components selected from spectral convergence loss and log-scale short-time Fourier transform (STFT)-magnitude loss to obtain a loss for the CNN; and using the loss to update the CNN. 2. The computer-implemented method of claim 1 wherein the CNN comprises: a plurality of heads, in which each head receives the input spectrogram and comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers. 3. The computer-implemented method of claim 2 wherein the heads of the plurality of heads are initialized with at least some different parameters to allow the CNN to focus on different portions of a waveform related to the input spectrogram during training. 4. The computer-implemented method of claim 2 wherein each head of the plurality of heads generates a head output waveform from the input spectrogram, the method further comprising: obtaining a combination of the head output waveforms, in which the head output waveforms or combined in a weighed combination using a trainable weight value for each head output waveform. 5. The computer-implemented method of claim 1 further comprising: applying a scaled softsign function to the weighted combination to obtain a final output waveform. 6. The computer-implemented method of claim 1 wherein the CNN further includes generative adversarial network (GAN) and the loss function further comprises a GAN loss component. 7. The computer-implemented method of claim 1 wherein the loss function further comprises one or more additional loss terms selected from instantaneous frequency loss, weighted phase loss, and waveform envelope loss. 8. The computer-implemented method of claim 1 wherein the CNN is trained with a large-scale multi-speaker dataset that produces a trained CNN for synthesizing a waveform for a speaker who was not included in the large-scale multi-speaker dataset or trained with a single speaker dataset that produces a trained CNN for synthesizing a waveform for the single speaker. 9. A computer-implemented method for using a trained convolutional neural network to generate a waveform from a spectrogram, the method comprising: inputting an input spectrogram comprising a number of frequency channels into a trained convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers and the head outputs an output waveform; applying a scaling function to the output waveform to obtain a final synthesized waveform; and outputting the final synthesized waveform corresponding to the input spectrogram. 10. The computer-implemented method of claim 9 wherein the trained CNN comprises: a plurality of heads, in which each head receives the input spectrogram, outputs a head output waveform, and comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers. 11. The computer-implemented method of claim 10 wherein the output waveform is obtained by performing the step comprising: combining the head output waveforms of the plurality of heads into the output waveform using a weighed combination of the head output waveforms wherein a head's output waveform is weighted using a trained weight value for that head. 12. The computer-implemented method of claim 9 further comprising wherein the scaling function is a scaled softsign function. 13. The computer-implemented method of claim 9 wherein the trained CNN was trained using a loss function comprising at least one or more loss components selected from spectral convergence loss, log-scale short-time Fourier transform (STFT)-magnitude loss, instantaneous frequency loss, weighted phase loss, and waveform envelope loss. 14. The computer-implemented method of claim 13 wherein the trained CNN was trained using a generative adversarial network (GAN) and the loss function further comprised a GAN loss component. 15. The computer-implemented method of claim 9 further comprising converting the input spectrogram from a mel-spectrogram. 16. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: inputting an input spectrogram comprising a number of frequency channels into a trained convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers and the head outputs an output waveform; applying a scaling function to the output waveform to obtain a final synthesized waveform; and outputting the final synthesized waveform corresponding to the input spectrogram. 17. The non-transitory computer-readable medium or media of claim 16 wherein the trained CNN comprises: a plurality of heads, in which each head receives the input spectrogram, outputs a head output waveform, and comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers. 18. The non-transitory computer-readable medium or media of claim 17 wherein the output waveform is obtained by performing the step comprising: combining the head output waveforms of the plurality of heads into the output waveform using a weighed combination of the head output waveforms wherein a head's output waveform is weighted using a trained weight value for that head. 19. The n

Assignees

Inventors

Classifications

  • of input or preprocessed data · CPC title

  • Classification; Matching · CPC title

  • Probabilistic or stochastic networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11462209B2 cover?
For the problem of waveform synthesis from spectrograms, presented herein are embodiments of an efficient neural network architecture, based on transposed convolutions to achieve a high compute intensity and fast inference. In one or more embodiments, for training of the convolutional vocoder architecture, losses are used that are related to perceptual audio quality, as well as a GAN framework …
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 04 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).