Deployed end-to-end speech recognition
US-2017148433-A1 · May 25, 2017 · US
US11705107B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11705107-B2 |
| Application number | US-202017061433-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 1, 2020 |
| Priority date | Feb 24, 2017 |
| Publication date | Jul 18, 2023 |
| Grant date | Jul 18, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for using a text-to-speech (TTS) system to synthesize human speech from text, comprising: using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; inputting the phonemes into either: (1) a trained phoneme duration and fundamental frequency model or (2) a trained phoneme duration model and a trained fundamental frequency model, to output for a phoneme: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile; and using a trained neural network audio synthesis model that receives the phonemes, the phoneme durations, the fundamental frequency profiles for the phonemes, and for each phoneme, a probability whether the phoneme is voiced as an input to generate a signal representing synthesized human speech of the written text. 2. The computer-implemented method of claim 1 wherein the step of using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text comprises: using, for one or more words in the written text, a phoneme dictionary look-up to convert the one or more words to phonemes. 3. The computer-implemented method of claim 1 further comprising utilizing one or more computational efficiencies to help the text-to-speech system produce the signal representing synthesized human speech of the written text in real-time or faster than real-time. 4. The computer-implemented method of claim 3 wherein one of the one or more computation efficiencies comprises the trained neural network audio synthesis model using multiple threads and overlapping computation on those threads to produce the signal representing synthesized human speech. 5. The computer-implemented method of claim 1 wherein the fundamental frequency profile for a phoneme is a set of fundamental frequencies values equally spaced in a time domain across the phoneme duration for the phoneme. 6. The computer-implemented method of claim 1 wherein the phoneme represent phoneme with stresses, when applicable to the phoneme. 7. The computer-implemented method of claim 6 wherein the trained neural network audio synthesis model comprises a conditioner network for biasing every layer with a per-timestep conditioning vector generated from a lower-frequency input signal comprising features obtained at least from the phonemes, including phoneme stresses, and the fundamental frequency profiles. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps for synthesizing human speech from text, comprising: converting written text to a set of phonemes corresponding to the written text using a trained grapheme-to-phoneme model; using either: (1) a trained phoneme duration and fundamental frequency model, or (2) a trained phoneme duration model and a trained fundamental frequency model, to obtain for each phoneme from the set of the phonemes: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile; and generating a signal representing synthesized human speech of the written text using a trained neural network audio synthesis model that receives the set of phonemes, the phoneme durations, the fundamental frequency profiles, and the probabilities whether the phonemes are voiced. 9. The non-transitory computer-readable medium or media of claim 8 wherein the step of converting convert written text to a set of phonemes corresponding to the written text using a trained grapheme-to-phoneme model comprises: using, for one or more words in the written text, a phoneme dictionary look-up to convert the one or more words to phonemes. 10. The non-transitory computer-readable medium or media of claim 8 further comprising one or more sequences of instructions which, when executed by at least one processor, causes one or more steps comprising: employing one or more computational efficiencies to produce the signal representing synthesized human speech of the written text in real-time or faster than real-time. 11. The non-transitory computer-readable medium or media of claim 10 wherein one of the one or more computation efficiencies comprises the trained neural network audio synthesis model using multiple threads and overlapping computation on those threads to produce the signal representing synthesized human speech. 12. The non-transitory computer-readable medium or media of claim 8 wherein the phoneme represent phoneme with stresses, when applicable to the phoneme. 13. The non-transitory computer-readable medium or media of claim 12 wherein the trained neural network audio synthesis model comprises a conditioner network for one or more layers with a per-timestep conditioning vector generated from a lower-frequency input signal comprising features obtained at least from the phonemes, including phoneme stresses, and the fundamental frequency profiles. 14. A system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; inputting the phonemes into either: (1) a trained phoneme duration and fundamental frequency model or (2) a trained phoneme duration model and a trained fundamental frequency model, to output for a phoneme: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile; and using a trained neural network audio synthesis model that receives the phonemes, the phoneme durations, the fundamental frequency profiles for the phonemes, and for each phoneme, a probability whether the phoneme is voiced as an input to generate a signal representing synthesized human speech of the written text. 15. The system of claim 14 wherein the step of using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text comprises: using, for one or more words in the written text, a phoneme dictionary look-up to convert the one or more words to phonemes. 16. The system of claim 14 further comprising utilizing one or more computational efficiencies to help the system produce the signal representing synthesized human speech of the written text in real-time or faster than real-time. 17. The system of claim 16 wherein one of the one or more computation efficiencies comprises the trained neural network audio synthesis model using multiple threads and overlapping computation on those threads to produce the signal representing synthesized human speech. 18. The system of claim 14 wherein the fundamental frequency profile for a phoneme is a set of fundamental frequencies values equally spaced in a time domain across the phoneme duration for the phoneme. 19. The system of claim 14 wherein the phoneme represent phoneme with stresses, when applicable to the phoneme. 20. The system of claim 19 wherein the trained neural network audio synthesis model comprises a conditioner network for biasing every layer with a per-timestep conditioning vector generated from a lower-frequency input signal comprising features obtained at least from the phonemes, including phoneme stresses, and the fundamental frequency profiles.
Recurrent networks, e.g. Hopfield networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Generative networks · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.