What technology area does this patent fall under?

Primary CPC classification G10L13/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 22 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Systems and methods for real-time neural text-to-speech

US10872598B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10872598-B2
Application number	US-201815882926-A
Country	US
Kind code	B2
Filing date	Jan 29, 2018
Priority date	Feb 24, 2017
Publication date	Dec 22, 2020
Grant date	Dec 22, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for training a text-to-speech (TTS) system to synthesize human speech from text, comprising: training a grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; using the trained grapheme-to-phoneme model to convert written text, which is a transcription corresponding to training audio, to phonemes corresponding to the written text and training audio; using the training audio and the corresponding phonemes to train a segmentation model to output phoneme durations by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes; given a ground truth dataset comprising ground truth written text representing a transcription of ground truth training audio, using the trained grapheme-to-phoneme model to produce phonemes; given the ground truth training audio and the corresponding phonemes, using the trained segmentation model to produce phoneme durations; and using the ground truth training audio, the phonemes, the phoneme durations, and fundamental frequencies of the ground truth training audio to train an audio synthesis model that outputs a signal representing synthesized human speech of the ground truth written text. 2. The computer-implemented method of claim 1 further comprising: extracting the fundamental frequencies for the ground truth training audio; and training a phoneme duration and fundamental frequency model using the fundamental frequencies, the phonemes, and the phoneme durations to output for each phoneme: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile. 3. The computer-implemented method of claim 1 wherein the phonemes comprise the phoneme previously obtained as an input to train the segmentation model. 4. The computer-implemented method of claim 1 wherein one or more of the steps of using the trained grapheme-to-phoneme model to convert written text to phonemes and using the trained grapheme-to-phoneme model to produce phonemes comprises: using a phoneme dictionary or the trained grapheme-to-phoneme model to convert written text to phonemes, where the grapheme-to-phoneme model is trained using training data from a phoneme dictionary to generalize to unseen text. 5. The computer-implemented method of claim 1 wherein training the segmentation model to output phoneme durations by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes comprises using a connectionist temporal classification (CTC) loss to predict sequences of phoneme pairs. 6. The computer-implemented method of claim 1 wherein the audio synthesis model uses multiple threads and overlapping computation on those threads to produce the synthesized human speech. 7. The computer-implemented method of claim 1 wherein the fundamental frequency profile for a phoneme is a set of fundamental frequencies values equally spaced in a time domain across the phoneme duration for the phoneme. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: training a grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; using the trained grapheme-to-phoneme model to convert written text, which is a transcription corresponding to training audio, to phonemes corresponding to the written text and training audio; using the training audio and the corresponding phonemes to train a segmentation model to output phoneme durations by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes; given a ground truth dataset comprising ground truth written text representing a transcription of ground truth training audio, using the trained grapheme-to-phoneme model to produce phonemes; given the ground truth training audio and the corresponding phonemes, using the trained segmentation model to produce phoneme durations; and using the ground truth training audio, the phonemes, the phoneme durations, and fundamental frequencies of the ground truth training audio to train an audio synthesis model that outputs a signal representing synthesized human speech of the ground truth written text. 9. The non-transitory computer-readable medium or media of claims 8 further comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: extracting the fundamental frequencies for the ground truth training audio; and training a phoneme duration and fundamental frequency model using the fundamental frequencies, the phonemes, and the phoneme durations to output for each phoneme: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile. 10. The non-transitory computer-readable medium or media of claims 8 wherein the phonemes comprise the phoneme previously obtained as an input to train the segmentation model. 11. The non-transitory computer-readable medium or media of claims 8 wherein one or more of the steps of using the trained grapheme-to-phoneme model to convert written text to phonemes and using the trained grapheme-to-phoneme model to produce phonemes comprise: using a phoneme dictionary or the trained grapheme-to-phoneme model to convert written text to phonemes, wherein the grapheme-to-phoneme model is trained using training data from a phoneme dictionary to generalize to unseen text. 12. The non-transitory computer-readable medium or media of claims 8 wherein training the segmentation model to output phoneme durations by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes comprises using a connectionist temporal classification (CTC) loss. 13. The non-transitory computer-readable medium or media of claims 8 wherein the segmentation model is trained to predict sequences of phoneme pairs. 14. The non-transitory computer-readable medium or media of claims 8 wherein the fundamental frequency profile for a phoneme is a set of fundamental frequencies values equally spaced in a time domain across the phoneme duration for the phoneme. 15. A system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: training a grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; using the trained grapheme-to-phoneme model to convert written text, which is a transcription corresponding to training audio, to phonemes corresponding to the written text and training audio; using the training audio and the corresponding phonemes to train a segmentation model to output phoneme durations by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes; given a ground truth dataset comprising ground truth written text representing a transcription of ground truth training audio, using the trained grapheme-to-phoneme model to produce phonemes; given the ground truth training audio and the corresponding phonemes, using the trained segmentation model to produce phoneme durations; and using the ground truth training audio, the phonemes, the phoneme durations, and fundamental frequencies of the ground truth training audio to train an audio synthesis model that outputs a signal representing synthesized human speech of

Assignees

Baidu Usa Llc

Inventors

Classifications

G06N3/047
Probabilistic or stochastic networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

Patent family

Related publications grouped by family.

View patent family 63246423

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10872598B2 cover?: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of th…
Who is the assignee on this patent?: Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 22 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Artificial intelligence-based text-to-speech system and method

Word generation for speech recognition

Deployed end-to-end speech recognition

Active learning for lexical annotations

Text-to-speech with emotional content

Method and system for efficient spoken term detection using confusion networks

Voice font speaker and prosody interpolation

Frequently asked questions