Apparatus and method that detect wheel alignment condition
US-2019325290-A1 · Oct 24, 2019 · US
US11462209B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11462209-B2 |
| Application number | US-201916365673-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 27, 2019 |
| Priority date | May 18, 2018 |
| Publication date | Oct 4, 2022 |
| Grant date | Oct 4, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
For the problem of waveform synthesis from spectrograms, presented herein are embodiments of an efficient neural network architecture, based on transposed convolutions to achieve a high compute intensity and fast inference. In one or more embodiments, for training of the convolutional vocoder architecture, losses are used that are related to perceptual audio quality, as well as a GAN framework to guide with a critic that discerns unrealistic waveforms. While yielding a high-quality audio, embodiments of the model can achieve more than 500 times faster than real-time audio synthesis. Multi-head convolutional neural network (MCNN) embodiments for waveform synthesis from spectrograms are also disclosed. MCNN embodiments enable significantly better utilization of modern multi-core processors than commonly-used iterative algorithms like Griffin-Lim and yield very fast (more than 300× real-time) waveform synthesis. Embodiments herein yield high-quality speech synthesis, without any iterative algorithms or autoregression in computations.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for training a neural network model for spectrogram inversion comprising: inputting an input spectrogram comprising a number of frequency channels into a convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers; outputting from the CNN a synthesized waveform for the input spectrogram, the input spectrogram having a corresponding ground truth waveform; using the corresponding ground truth waveform, the synthesized waveform, and a loss function comprising at least one or more loss components selected from spectral convergence loss and log-scale short-time Fourier transform (STFT)-magnitude loss to obtain a loss for the CNN; and using the loss to update the CNN. 2. The computer-implemented method of claim 1 wherein the CNN comprises: a plurality of heads, in which each head receives the input spectrogram and comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers. 3. The computer-implemented method of claim 2 wherein the heads of the plurality of heads are initialized with at least some different parameters to allow the CNN to focus on different portions of a waveform related to the input spectrogram during training. 4. The computer-implemented method of claim 2 wherein each head of the plurality of heads generates a head output waveform from the input spectrogram, the method further comprising: obtaining a combination of the head output waveforms, in which the head output waveforms or combined in a weighed combination using a trainable weight value for each head output waveform. 5. The computer-implemented method of claim 1 further comprising: applying a scaled softsign function to the weighted combination to obtain a final output waveform. 6. The computer-implemented method of claim 1 wherein the CNN further includes generative adversarial network (GAN) and the loss function further comprises a GAN loss component. 7. The computer-implemented method of claim 1 wherein the loss function further comprises one or more additional loss terms selected from instantaneous frequency loss, weighted phase loss, and waveform envelope loss. 8. The computer-implemented method of claim 1 wherein the CNN is trained with a large-scale multi-speaker dataset that produces a trained CNN for synthesizing a waveform for a speaker who was not included in the large-scale multi-speaker dataset or trained with a single speaker dataset that produces a trained CNN for synthesizing a waveform for the single speaker. 9. A computer-implemented method for using a trained convolutional neural network to generate a waveform from a spectrogram, the method comprising: inputting an input spectrogram comprising a number of frequency channels into a trained convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers and the head outputs an output waveform; applying a scaling function to the output waveform to obtain a final synthesized waveform; and outputting the final synthesized waveform corresponding to the input spectrogram. 10. The computer-implemented method of claim 9 wherein the trained CNN comprises: a plurality of heads, in which each head receives the input spectrogram, outputs a head output waveform, and comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers. 11. The computer-implemented method of claim 10 wherein the output waveform is obtained by performing the step comprising: combining the head output waveforms of the plurality of heads into the output waveform using a weighed combination of the head output waveforms wherein a head's output waveform is weighted using a trained weight value for that head. 12. The computer-implemented method of claim 9 further comprising wherein the scaling function is a scaled softsign function. 13. The computer-implemented method of claim 9 wherein the trained CNN was trained using a loss function comprising at least one or more loss components selected from spectral convergence loss, log-scale short-time Fourier transform (STFT)-magnitude loss, instantaneous frequency loss, weighted phase loss, and waveform envelope loss. 14. The computer-implemented method of claim 13 wherein the trained CNN was trained using a generative adversarial network (GAN) and the loss function further comprised a GAN loss component. 15. The computer-implemented method of claim 9 further comprising converting the input spectrogram from a mel-spectrogram. 16. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: inputting an input spectrogram comprising a number of frequency channels into a trained convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers and the head outputs an output waveform; applying a scaling function to the output waveform to obtain a final synthesized waveform; and outputting the final synthesized waveform corresponding to the input spectrogram. 17. The non-transitory computer-readable medium or media of claim 16 wherein the trained CNN comprises: a plurality of heads, in which each head receives the input spectrogram, outputs a head output waveform, and comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers. 18. The non-transitory computer-readable medium or media of claim 17 wherein the output waveform is obtained by performing the step comprising: combining the head output waveforms of the plurality of heads into the output waveform using a weighed combination of the head output waveforms wherein a head's output waveform is weighted using a trained weight value for that head. 19. The n
of input or preprocessed data · CPC title
Classification; Matching · CPC title
Probabilistic or stochastic networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.