Parallel neural text-to-speech
US-11017761-B2 · May 25, 2021 · US
US12100382B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12100382-B2 |
| Application number | US-202117492543-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 1, 2021 |
| Priority date | Oct 2, 2020 |
| Publication date | Sep 24, 2024 |
| Grant date | Sep 24, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.
Opening claim text (preview).
What is claimed is: 1. A method for generating an output audio sequence from an input text sequence, wherein the input text sequence comprises a respective text element at each of a plurality of input time steps and the output audio sequence comprises a respective audio sample at each of a plurality of output time steps, the method comprising: processing the input text sequence using a first neural network to generate a modified input sequence comprising, for each of the plurality of input time steps, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps, the upsampling comprising: determining, for each representation in the modified sequence and using the predicted durations of the corresponding text elements in the output audio sequence, parameters of a distribution for the representation that assigns a respective value to each intermediate element that models an influence of the representation on the intermediate element based on the predicted durations for the corresponding text elements wherein the distribution for the representation is a Gaussian distribution, and wherein a center of the Gaussian distribution corresponds to a center of the predicted duration of the representation; and generating each intermediate element of the intermediate sequence based on the distributions for the representations in the modified sequence, the generating comprising, for each particular intermediate element: determining a respective weight for each representation from the value assigned to the particular intermediate element in the distribution generated for the representation; and generating the particular intermediate element by determining a weighted sum of the representations, wherein each representation is weighted according to the respective weight for the representation; and generating the output audio sequence using the intermediate sequence. 2. The method of claim 1 , wherein the center of the Gaussian distribution for a particular representation is: c i = d i 2 + ∑ j = 1 i - 1 d j , wherein c i is the center of the Gaussian distribution for the particular representation, d i is the predicted duration of the particular representation, and each d j is the predicted duration of a respective representation that precedes the particular representation in the modified input sequence. 3. The method of claim 1 , wherein a variance of the Gaussian distribution for each respective representation is generated by processing the modified input sequence using a fourth neural network. 4. The method of claim 3 , wherein processing the modified input sequence using the fourth neural network comprises: combining, for each representation in the modified input sequence, the representation with the predicted duration of the representation to generate a respective combined representation; and processing the combined representations using the fourth neural network to generate the respective variance of the Gaussian distribution for each representation. 5. The method of claim 1 , wherein upsampling the modified input sequence to generate an intermediate sequence comprises: upsampling the modified input sequence to generate an upsampled sequence comprising a respective upsampled representation at each of the plurality of intermediate time steps; and generating the intermediate sequence from the upsampled sequence, comprising combining, for each upsampled representation in the upsampled text sequence, the upsampled representation with a positional embedding of the upsampled representation. 6. The method of claim 5 , wherein the positional embedding of an upsampled representation identifies a position of the upsampled representation in a subsequence of upsampled representations corresponding to the same representation in the modified input sequence. 7. The method of claim 1 , wherein generating the output audio sequence using the intermediate sequence comprises: processing the intermediate sequence using a third neural network to generate a mel-spectrogram comprising a respective spectrogram frame at each of the plurality of intermediate time steps; and processing the mel-spectrogram to generate the output audio sequence. 8. The method of claim 7 , wherein the first neural network, the second neural network, and the third neural network have been trained concurrently. 9. The method of claim 8 , wherein the neural networks are trained using a loss term that includes one or more of: a first term characterizing an error in the predicted durations of the representations in the modified input sequence; or a second term characterizing an error in the generated mel-spectrogram. 10. The method of claim 8 , wherein the training comprises teacher forcing using ground-truth durations for each representation in the modified input sequence. 11. The method of claim 8 , wherein the training comprises training the neural networks without any ground-truth durations for representations in the modified input sequence. 12. The method of claim 11 , wherein the training comprises: obtaining a training input text sequence comprising a respective training text element at each of a plurality of training input time steps; processing the training input text sequence using a first subnetwork of the first neural network to generate an embedding of the training input text sequence; obtaining a ground-truth mel-spectrogram corresponding to the training input text sequence; processing the ground-truth mel-spectrogram using a second subnetwork of the first neural network to generate an embedding of the ground-truth mel-spectrogram; combining i) the embedding of the training input text sequence and ii) the embedding of the ground-truth mel-spectrogram to generate a training modified input sequence comprising, for each of the plurality of training input time steps, a representation of the corresponding training text element in the training input text sequence; and processing the training modified input sequence using the second neural network to generate, for each representation in the training modified input sequence, a predicted duration of the representation. 13. The method of claim 12 , wherein combining i) the embedding of the training input text sequence and ii) the embedding of the ground-truth mel-spectrogram comprises processing i) the embedding of the training input text sequence and
Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title
using neural networks · CPC title
Duration · CPC title
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
Prosody rules derived from text; Stress or intonation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.