Spectrogram to waveform synthesis using convolutional networks
US-2019355347-A1 · Nov 21, 2019 · US
US12148444B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12148444-B2 |
| Application number | US-202117222736-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 5, 2021 |
| Priority date | Aug 8, 2018 |
| Publication date | Nov 19, 2024 |
| Grant date | Nov 19, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.
Opening claim text (preview).
What is claimed: 1. A method for generating, from an input data representing a text input, an output sequence of audio data corresponding to the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, and wherein the method comprises, for each of the plurality of time steps: generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting an audio output sample for the time step from the plurality of possible audio output samples in accordance with the probability distribution. 2. The method of claim 1 , wherein the vocoder neural network is an autoregressive neural network, and wherein generating the probability distribution further comprises: conditioning the autoregressive neural network on a current output sequence of audio data comprising respective audio output samples for each preceding time step in the output sequence. 3. The method of claim 1 , wherein processing the representation of the respective portion of the input data, comprises: processing, by an encoder neural network, the input data to generate a feature representation of the input data; and processing, by an attention network, the feature representation, to generate a fixed-length context vector for the time step. 4. The method of claim 3 , wherein the encoder neural network comprises a convolutional subnetwork and a bidirectional long short-term memory (LSTM) layer, and wherein the feature representation is a sequential feature representation that represents a local structure of the input data around a particular character in the input data. 5. The method of claim 2 , wherein the autoregressive neural network of the vocoder neural network, comprises: a convolutional subnetwork having one or more dilated convolutional layers; and a logistic output layer configured to generate a continuous probability distribution of possible audio output samples. 6. The method of claim 1 , wherein the probability distribution is a logistic distribution. 7. The method of claim 1 , wherein selecting one of the plurality of possible audio samples in accordance with the probability distribution for the time step comprises sampling from the probability distribution. 8. The method of claim 1 , wherein: each of the plurality of time steps corresponds to a respective time in an audio waveform, and wherein the respective audio sample at each of the plurality of time steps is an amplitude value of the audio waveform at the corresponding time step; and a frame length of each mel-frequency spectrogram is 12.5 milliseconds. 9. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: generating, from an input data representing a text input, an output sequence of audio data representing the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, including, for each of the plurality of time steps: generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting an audio output sample for the time step from the plurality of possible audio output samples in accordance with the probability distribution. 10. The system of claim 9 , wherein the vocoder neural network is an autoregressive neural network, and wherein generating the probability distribution further comprises: conditioning the autoregressive neural network on a current output sequence of audio data comprising respective audio output samples for each preceding time step in the output sequence. 11. The system of claim 9 , wherein processing the representation of the respective portion of the input data using the decoder neural network, comprises: processing, by an encoder neural network, the input data to generate a feature representation of a character sequence in the input data; and processing, by an attention network, the feature representation, to generate a fixed-length context vector for the time step. 12. The system of claim 11 , wherein the encoder neural network comprises a convolutional subnetwork and a bidirectional long short-term memory (LSTM) layer, and wherein the feature representation is a sequential feature representation that represents a local structure of the input data around a particular character in an input character sequence included in the input data. 13. The system of claim 9 , wherein the autoregressive neural network of the vocoder neural network comprises: a convolutional subnetwork having one or more dilated convolutional layers; and a logistic output layer configured to generate a continuous probability distribution of possible audio output samples. 14. The system of claim 9 , wherein the probability distribution is a logistic distribution and wherein selecting one of the plurality of possible audio samples in accordance with the probability distribution for the time step comprises sampling from the probability distribution. 15. The system of claim 11 , wherein each of the plurality of time steps corresponds to a respective time in an audio waveform, and wherein the respective audio sample at each of the plurality of time steps is an amplitude value of the audio waveform at the corresponding time step and wherein a frame length of each mel-frequency spectrogram is 12.5 milliseconds. 16. A method for generating, from an input data representing a text input, an output sequence of audio data corresponding to the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, and wherein the method comprises, for each of the plurality of time steps: generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork; determining, using an autoregressive vocoder neural network and based on the mel-frequency spectrogram, an audio output sample for the time step from among a plurality of possible audio output samples for the time step; and generating an output sequence of audio data based on the audio output samples for the plurality of time steps. 17. The method of claim 16 , further comprising: conditioning the autoregressive vocoder neural network on a current output sequence of audio data comprising res
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Supervised learning · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.