What technology area does this patent fall under?

Primary CPC classification G10L13/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 18 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Real-time neural text-to-speech

US11705107B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11705107-B2
Application number	US-202017061433-A
Country	US
Kind code	B2
Filing date	Oct 1, 2020
Priority date	Feb 24, 2017
Publication date	Jul 18, 2023
Grant date	Jul 18, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for using a text-to-speech (TTS) system to synthesize human speech from text, comprising: using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; inputting the phonemes into either: (1) a trained phoneme duration and fundamental frequency model or (2) a trained phoneme duration model and a trained fundamental frequency model, to output for a phoneme: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile; and using a trained neural network audio synthesis model that receives the phonemes, the phoneme durations, the fundamental frequency profiles for the phonemes, and for each phoneme, a probability whether the phoneme is voiced as an input to generate a signal representing synthesized human speech of the written text. 2. The computer-implemented method of claim 1 wherein the step of using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text comprises: using, for one or more words in the written text, a phoneme dictionary look-up to convert the one or more words to phonemes. 3. The computer-implemented method of claim 1 further comprising utilizing one or more computational efficiencies to help the text-to-speech system produce the signal representing synthesized human speech of the written text in real-time or faster than real-time. 4. The computer-implemented method of claim 3 wherein one of the one or more computation efficiencies comprises the trained neural network audio synthesis model using multiple threads and overlapping computation on those threads to produce the signal representing synthesized human speech. 5. The computer-implemented method of claim 1 wherein the fundamental frequency profile for a phoneme is a set of fundamental frequencies values equally spaced in a time domain across the phoneme duration for the phoneme. 6. The computer-implemented method of claim 1 wherein the phoneme represent phoneme with stresses, when applicable to the phoneme. 7. The computer-implemented method of claim 6 wherein the trained neural network audio synthesis model comprises a conditioner network for biasing every layer with a per-timestep conditioning vector generated from a lower-frequency input signal comprising features obtained at least from the phonemes, including phoneme stresses, and the fundamental frequency profiles. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps for synthesizing human speech from text, comprising: converting written text to a set of phonemes corresponding to the written text using a trained grapheme-to-phoneme model; using either: (1) a trained phoneme duration and fundamental frequency model, or (2) a trained phoneme duration model and a trained fundamental frequency model, to obtain for each phoneme from the set of the phonemes: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile; and generating a signal representing synthesized human speech of the written text using a trained neural network audio synthesis model that receives the set of phonemes, the phoneme durations, the fundamental frequency profiles, and the probabilities whether the phonemes are voiced. 9. The non-transitory computer-readable medium or media of claim 8 wherein the step of converting convert written text to a set of phonemes corresponding to the written text using a trained grapheme-to-phoneme model comprises: using, for one or more words in the written text, a phoneme dictionary look-up to convert the one or more words to phonemes. 10. The non-transitory computer-readable medium or media of claim 8 further comprising one or more sequences of instructions which, when executed by at least one processor, causes one or more steps comprising: employing one or more computational efficiencies to produce the signal representing synthesized human speech of the written text in real-time or faster than real-time. 11. The non-transitory computer-readable medium or media of claim 10 wherein one of the one or more computation efficiencies comprises the trained neural network audio synthesis model using multiple threads and overlapping computation on those threads to produce the signal representing synthesized human speech. 12. The non-transitory computer-readable medium or media of claim 8 wherein the phoneme represent phoneme with stresses, when applicable to the phoneme. 13. The non-transitory computer-readable medium or media of claim 12 wherein the trained neural network audio synthesis model comprises a conditioner network for one or more layers with a per-timestep conditioning vector generated from a lower-frequency input signal comprising features obtained at least from the phonemes, including phoneme stresses, and the fundamental frequency profiles. 14. A system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text; inputting the phonemes into either: (1) a trained phoneme duration and fundamental frequency model or (2) a trained phoneme duration model and a trained fundamental frequency model, to output for a phoneme: a phoneme duration; a probability that the phoneme is voiced; and a fundamental frequency profile; and using a trained neural network audio synthesis model that receives the phonemes, the phoneme durations, the fundamental frequency profiles for the phonemes, and for each phoneme, a probability whether the phoneme is voiced as an input to generate a signal representing synthesized human speech of the written text. 15. The system of claim 14 wherein the step of using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text comprises: using, for one or more words in the written text, a phoneme dictionary look-up to convert the one or more words to phonemes. 16. The system of claim 14 further comprising utilizing one or more computational efficiencies to help the system produce the signal representing synthesized human speech of the written text in real-time or faster than real-time. 17. The system of claim 16 wherein one of the one or more computation efficiencies comprises the trained neural network audio synthesis model using multiple threads and overlapping computation on those threads to produce the signal representing synthesized human speech. 18. The system of claim 14 wherein the fundamental frequency profile for a phoneme is a set of fundamental frequencies values equally spaced in a time domain across the phoneme duration for the phoneme. 19. The system of claim 14 wherein the phoneme represent phoneme with stresses, when applicable to the phoneme. 20. The system of claim 19 wherein the trained neural network audio synthesis model comprises a conditioner network for biasing every layer with a per-timestep conditioning vector generated from a lower-frequency input signal comprising features obtained at least from the phonemes, including phoneme stresses, and the fundamental frequency profiles.

Assignees

Baidu Usa Llc

Inventors

Classifications

G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0475
Generative networks · CPC title
G06N3/09
Supervised learning · CPC title

Patent family

Related publications grouped by family.

View patent family 63246423

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11705107B2 cover?: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of th…
Who is the assignee on this patent?: Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 18 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).