What technology area does this patent fall under?

Primary CPC classification G10L13/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 22 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Systems and methods for parallel wave generation in end-to-end text-to-speech

US10872596B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10872596-B2
Application number	US-201916277919-A
Country	US
Kind code	B2
Filing date	Feb 15, 2019
Priority date	Oct 19, 2017
Publication date	Dec 22, 2020
Grant date	Dec 22, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

First claim

Opening claim text (preview).

What is claimed is: 1. A text-to-speech system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: converting, using an encoder, textual features of an input text into encoder hidden representations; decoding, using a decoder, the encoder hidden representations with attention into frame-level hidden representations in an autoregressive manner; processing, using a convolutional processing block, the frame-level hidden representations into sample-level hidden representations; and synthesizing, using a non-autoregressive distilled vocoder distilled from an autoregressive vocoder, waveforms corresponding to the input text conditioned on the sample-level hidden representations. 2. The text-to-speech system of claim 1 wherein the non-autoregressive distilled vocoder is an inverse autoregressive flow (IAF) trained using at least a regularized Kullback-Leibler (KL) divergence between output distributions of the autoregressive vocoder and the non-autoregressive distilled vocoder, the regularized KL divergence incorporates a regularization in addition to a reverse KL divergence to stabilize the training process. 3. The text-to-speech system of claim 2 wherein the output distribution of the autoregressive vocoder is a single Gaussian distribution. 4. The text-to-speech system of claim 2 wherein the regularized KL divergence is obtained in a closed-form. 5. The text-to-speech system of claim 2 wherein the non-autoregressive distilled vocoder is trained using a frame-level loss between output from the non-autoregressive distilled vocoder and corresponding ground-truth audio, in combination with the regularized KL divergence. 6. The text-to-speech system of claim 5 wherein the frame-level loss is a spectrogram frame loss. 7. The text-to-speech system of claim 1 wherein the encoder, the decoder, and the convolutional processing block are pre-trained with the autoregressive vocoder to fix the parameters of the encoder, the decoder, and the convolutional processing block. 8. The text-to-speech system of claim 1 wherein the convolutional processing block is non-causal and enabled to apply non-causal convolution to utilize future temporal information. 9. A computer-implemented method for training an end-to-end text-to-speech system to synthesize speech from an input text, comprising: encoding, via an encoder comprising one or more convolutional blocks, the input text into encoder hidden representations including key representations and value representations; autoregressively decoding, using a decoder, the encoder hidden representations with attention into frame-level hidden representations; processing, using a convolutional processing block, the frame-level hidden representations into sample-level hidden representations; generating, using an autoregressive vocoder, synthesized waveforms corresponding to the input text conditioned on the sample-level hidden representations; and distilling the autoregressive vocoder to obtain a distilled parallel vocoder based on Gaussian inverse autoregressive flow (IAF) using at least a regularized Kullback-Leibler (KL) divergence between output distributions of the autoregressive vocoder and the distilled parallel vocoder. 10. The computer-implemented method of claim 9 wherein the encoder, the decoder, and the convolutional processing block are pre-trained and have parameters fixed during the distillation. 11. The computer-implemented method of claim 9 wherein the output distribution of the autoregressive vocoder is a single Gaussian distribution. 12. The computer-implemented method of claim 11 wherein the regularized KL divergence is obtained in a closed-form. 13. The computer-implemented method of claim 9 wherein the distilled parallel vocoder is trained using a frame-level loss between output from the distilled vocoder and corresponding ground-truth audio, in combination with the regularized KL divergence. 14. The computer-implemented method of claim 9 wherein the distilled parallel vocoder is a non-autoregressive model. 15. The computer-implemented method of claim 9 wherein the convolutional processing block is non-causal and enabled to apply non-causal convolution to utilize future temporal information. 16. A computer-implemented method for training a text-to-speech system to synthesize speech from ground-truth spectrograms, comprising: receiving, at an autoregressive vocoder, ground-truth spectrograms for waveform synthesizing; receiving, at a parallel vocoder distilled from the autoregressive vocoder based on Gaussian inverse autoregressive flow (IAF), the ground-truth spectrograms for waveform synthesizing; and training the parallel vocoder using a loss function having a linear combination of a frame-level loss and a regularized Kullback-Leibler (KL) divergence between waveform distributions of the autoregressive vocoder and the parallel vocoder. 17. The computer-implemented method of claim 16 wherein the regularized KL divergence incorporates a regularization in addition to a reverse KL divergence to stabilize the training process. 18. The computer-implemented method of claim 16 wherein the output distribution of the autoregressive vocoder is a single Gaussian distribution. 19. The computer-implemented method of claim 16 wherein the regularized KL divergence is obtained in a closed-form. 20. The computer-implemented method of claim 16 wherein the frame-level loss is a spectrogram frame loss obtained, using a ground-truth data set, from difference between output of the parallel vocoder and corresponding ground-truth audio.

Assignees

Baidu Usa Llc

Inventors

Classifications

G10L13/00Primary
Speech synthesis; Text to speech systems · CPC title
G06F9/30003
Arrangements for executing specific machine instructions · CPC title
G10L13/08Primary
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

Patent family

Related publications grouped by family.

View patent family 66697137

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10872596B2 cover?: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in…
Who is the assignee on this patent?: Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 22 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Artificial intelligence-based text-to-speech system and method

Word generation for speech recognition

Deployed end-to-end speech recognition

Active learning for lexical annotations

Text-to-speech with emotional content

Method and system for efficient spoken term detection using confusion networks

Voice font speaker and prosody interpolation

Frequently asked questions