What technology area does this patent fall under?

Primary CPC classification G10L13/027. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Text-to-speech using duration prediction

Patent metadata
Field	Value
Publication number	US-12100382-B2
Application number	US-202117492543-A
Country	US
Kind code	B2
Filing date	Oct 1, 2021
Priority date	Oct 2, 2020
Publication date	Sep 24, 2024
Grant date	Sep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating an output audio sequence from an input text sequence, wherein the input text sequence comprises a respective text element at each of a plurality of input time steps and the output audio sequence comprises a respective audio sample at each of a plurality of output time steps, the method comprising: processing the input text sequence using a first neural network to generate a modified input sequence comprising, for each of the plurality of input time steps, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps, the upsampling comprising: determining, for each representation in the modified sequence and using the predicted durations of the corresponding text elements in the output audio sequence, parameters of a distribution for the representation that assigns a respective value to each intermediate element that models an influence of the representation on the intermediate element based on the predicted durations for the corresponding text elements wherein the distribution for the representation is a Gaussian distribution, and wherein a center of the Gaussian distribution corresponds to a center of the predicted duration of the representation; and generating each intermediate element of the intermediate sequence based on the distributions for the representations in the modified sequence, the generating comprising, for each particular intermediate element: determining a respective weight for each representation from the value assigned to the particular intermediate element in the distribution generated for the representation; and generating the particular intermediate element by determining a weighted sum of the representations, wherein each representation is weighted according to the respective weight for the representation; and generating the output audio sequence using the intermediate sequence. 2. The method of claim 1 , wherein the center of the Gaussian distribution for a particular representation is: c i = d i 2 + ∑ j = 1 i - 1 d j , wherein c i is the center of the Gaussian distribution for the particular representation, d i is the predicted duration of the particular representation, and each d j is the predicted duration of a respective representation that precedes the particular representation in the modified input sequence. 3. The method of claim 1 , wherein a variance of the Gaussian distribution for each respective representation is generated by processing the modified input sequence using a fourth neural network. 4. The method of claim 3 , wherein processing the modified input sequence using the fourth neural network comprises: combining, for each representation in the modified input sequence, the representation with the predicted duration of the representation to generate a respective combined representation; and processing the combined representations using the fourth neural network to generate the respective variance of the Gaussian distribution for each representation. 5. The method of claim 1 , wherein upsampling the modified input sequence to generate an intermediate sequence comprises: upsampling the modified input sequence to generate an upsampled sequence comprising a respective upsampled representation at each of the plurality of intermediate time steps; and generating the intermediate sequence from the upsampled sequence, comprising combining, for each upsampled representation in the upsampled text sequence, the upsampled representation with a positional embedding of the upsampled representation. 6. The method of claim 5 , wherein the positional embedding of an upsampled representation identifies a position of the upsampled representation in a subsequence of upsampled representations corresponding to the same representation in the modified input sequence. 7. The method of claim 1 , wherein generating the output audio sequence using the intermediate sequence comprises: processing the intermediate sequence using a third neural network to generate a mel-spectrogram comprising a respective spectrogram frame at each of the plurality of intermediate time steps; and processing the mel-spectrogram to generate the output audio sequence. 8. The method of claim 7 , wherein the first neural network, the second neural network, and the third neural network have been trained concurrently. 9. The method of claim 8 , wherein the neural networks are trained using a loss term that includes one or more of: a first term characterizing an error in the predicted durations of the representations in the modified input sequence; or a second term characterizing an error in the generated mel-spectrogram. 10. The method of claim 8 , wherein the training comprises teacher forcing using ground-truth durations for each representation in the modified input sequence. 11. The method of claim 8 , wherein the training comprises training the neural networks without any ground-truth durations for representations in the modified input sequence. 12. The method of claim 11 , wherein the training comprises: obtaining a training input text sequence comprising a respective training text element at each of a plurality of training input time steps; processing the training input text sequence using a first subnetwork of the first neural network to generate an embedding of the training input text sequence; obtaining a ground-truth mel-spectrogram corresponding to the training input text sequence; processing the ground-truth mel-spectrogram using a second subnetwork of the first neural network to generate an embedding of the ground-truth mel-spectrogram; combining i) the embedding of the training input text sequence and ii) the embedding of the ground-truth mel-spectrogram to generate a training modified input sequence comprising, for each of the plurality of training input time steps, a representation of the corresponding training text element in the training input text sequence; and processing the training modified input sequence using the second neural network to generate, for each representation in the training modified input sequence, a predicted duration of the representation. 13. The method of claim 12 , wherein combining i) the embedding of the training input text sequence and ii) the embedding of the ground-truth mel-spectrogram comprises processing i) the embedding of the training input text sequence and

Assignees

Google Llc

Inventors

Classifications

G10L13/04
Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title
G10L25/30
using neural networks · CPC title
G10L2013/105
Duration · CPC title
G10L13/027Primary
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
G10L13/10Primary
Prosody rules derived from text; Stress or intonation · CPC title

Patent family

Related publications grouped by family.

View patent family 78463954

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12100382B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input ti…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/027. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Parallel neural text-to-speech

Synthetic speech processing

Text-to-speech (TTS) processing

Systems and methods for parallel wave generation in end-to-end text-to-speech

Systems and methods for real-time neural text-to-speech

Frequently asked questions