What technology area does this patent fall under?

Primary CPC classification G10L13/033. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Systems and methods for neural voice cloning with a few samples

US11238843B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11238843-B2
Application number	US-201816143330-A
Country	US
Kind code	B2
Filing date	Sep 26, 2018
Priority date	Feb 9, 2018
Publication date	Feb 1, 2022
Grant date	Feb 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for synthesizing audio from an input text, comprising: given a limited set of one or more audios of a new speaker that was not part of training data used to train a neural multi-speaker generative model, using a neural speaker encoder model comprising a first set of trained model parameters to obtain a speaker embedding for the new speaker given the limited set of one or more audios as an input to the neural speaker encoder model; and using the neural multi-speaker generative model comprising a second set of trained model parameters, the input text, and the speaker embedding for the new speaker generated by the neural speaker encoder model comprising the first set of trained model parameters to generate a synthesized audio representation for the input text in which the synthesized audio includes speech characteristics of the new speaker, wherein the neural multi-speaker generative model comprising the second set of trained parameters was trained using as inputs, for a speaker, (1) a training set of text-audio pairs, in which a text-audio pair comprises a text and a corresponding audio of that text by the speaker, and (2) a speaker embedding corresponding to a speaker identifier for that speaker. 2. The computer-implemented method of claim 1 wherein the first set of trained model parameters for the neural speaker encoder model and the second sets of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: training the neural multi-speaker generative model, using as inputs, for a speaker, the training set of text-audio pairs and a speaker embedding corresponding to the speaker identifier for that speaker, to obtain the second set of trained model parameters for the neural multi-speaker generative model and to obtain a set of speaker embeddings corresponding to the speaker identifiers; and training the neural speaker encoder model, using a set of audios selected from the training set of text-audio pairs and corresponding speaker embeddings for the speakers of the set of audios from the set of speaker embeddings, to obtain the first set of trained model parameters for the neural speaker encoder model. 3. The computer-implemented method of claim 1 wherein the first set of trained model parameters for the neural speaker encoder model and the second set of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: training the neural multi-speaker generative model, using as inputs, for a speaker, the training set of text-audio pairs and a speaker embedding corresponding to the speaker identifier for that speaker, to obtain a third set of trained model parameters for the neural multi-speaker generative model and to obtain a set of speaker embeddings corresponding to the speaker identifiers; training the neural speaker encoder model, using a set of audios selected from the training set of text-audio pairs and corresponding speaker embeddings for the speakers of the set of audios from the first set of speaker embeddings, to obtain a fourth set of trained model parameters for the neural speaker encoder model; and performing joint training the neural multi-speaker generative model comprising the third set of trained model parameters and the neural speaker encoder model comprising the fourth set of trained model parameters to adjust at least some of the third and fourth trained model parameters to obtain the first set of trained model parameters for the neural speaker encoder model and the second set of trained model parameters for the neural multi-speaker generative model by comparing synthesized audios generated by the neural multi-speaker generative model using speaker embeddings from the neural speaker encoder model to ground truth audios corresponding to the synthesized audios. 4. The computer-implemented method of claim 3 further comprising, as part of the joint training, adjusting at least some of parameters of the set of speaker embeddings. 5. The computer-implemented method of claim 1 wherein the first set of trained model parameters for the neural speaker encoder model and the second sets of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: performing joint training of the neural multi-speaker generative model and the neural speaker encoder model to obtain the first set of trained model parameters for the neural speaker encoder model and the second set of trained model parameters for the neural multi-speaker generative model by comparing synthesized audios generated by the neural multi-speaker generative model using speaker embeddings from the neural speaker encoder model to ground truth audios corresponding to the synthesized audios. 6. The computer-implemented method of claim 1 wherein the neural speaker encoder model comprises a neural network architecture comprising: a spectral processing network component that computes a spectral audio representation for input audio and passes the spectral audio representation to a prenet component comprising one or more fully-connected layers with one or more non-linearity units for feature transformation; a temporal processing network component in which temporal contexts are incorporated using a plurality of convolutional layers with gated linear unit and residual connections; and a cloning sample attention network component comprising a multi-head self-attention mechanism that determines weights for different audios and obtains aggregated speaker embeddings. 7. A generative text-to-speech system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a limited set of one or more audios of a new speaker that was not part of training data used to train a neural multi-speaker generative model, using a speaker encoder model comprising a first set of trained model parameters to obtain a speaker embedding for the new speaker given the limited set of one or more audios as an input to the speaker encoder model; and using the neural multi-speaker generative model comprising a second set of trained model parameters, an input text, and the speaker embedding for the new speaker generated by the speaker encoder model comprising the first set of trained model parameters to generate a synthesized audio representation for the input text in which the synthesized audio includes speech characteristics of the new speaker, wherein the neural multi-speaker generative model comprising the second set of trained parameters was trained using as inputs, for a speaker, (1) a training set of text-audio pairs, in which a text-audio pair comprises a text and a corresponding audio of that text by the speaker, and (2) a speaker embedding corresponding to a speaker identifier for that speaker. 8. The generative text-to-speech system of claim 7 wherein the first set of trained model parameters for the speaker encoder model and the second sets of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: training the neural multi-speaker generative model, using as inputs, for a speaker, the training set of text-audio pairs and a speaker embedding corresponding to the speaker identifier for that speaker, to obtain the second set of trained model parameters for the neural multi-speaker generative model and to obtain a set of speaker embeddings corresponding to the speaker identifiers; and training the speaker encoder model, using a set of audios se

Assignees

Baidu Usa Llc

Inventors

Classifications

G10L13/08
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
G10L25/60
for measuring the quality of voice signals · CPC title
G10L13/027
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
G10L13/033Primary
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
G10L13/047Primary
Architecture of speech synthesisers · CPC title

Patent family

Related publications grouped by family.

View patent family 67541088

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11238843B2 cover?: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are ba…
Who is the assignee on this patent?: Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).