Systems and methods for neural voice cloning with a few samples

US11238843B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11238843-B2
Application numberUS-201816143330-A
CountryUS
Kind codeB2
Filing dateSep 26, 2018
Priority dateFeb 9, 2018
Publication dateFeb 1, 2022
Grant dateFeb 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for synthesizing audio from an input text, comprising: given a limited set of one or more audios of a new speaker that was not part of training data used to train a neural multi-speaker generative model, using a neural speaker encoder model comprising a first set of trained model parameters to obtain a speaker embedding for the new speaker given the limited set of one or more audios as an input to the neural speaker encoder model; and using the neural multi-speaker generative model comprising a second set of trained model parameters, the input text, and the speaker embedding for the new speaker generated by the neural speaker encoder model comprising the first set of trained model parameters to generate a synthesized audio representation for the input text in which the synthesized audio includes speech characteristics of the new speaker, wherein the neural multi-speaker generative model comprising the second set of trained parameters was trained using as inputs, for a speaker, (1) a training set of text-audio pairs, in which a text-audio pair comprises a text and a corresponding audio of that text by the speaker, and (2) a speaker embedding corresponding to a speaker identifier for that speaker. 2. The computer-implemented method of claim 1 wherein the first set of trained model parameters for the neural speaker encoder model and the second sets of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: training the neural multi-speaker generative model, using as inputs, for a speaker, the training set of text-audio pairs and a speaker embedding corresponding to the speaker identifier for that speaker, to obtain the second set of trained model parameters for the neural multi-speaker generative model and to obtain a set of speaker embeddings corresponding to the speaker identifiers; and training the neural speaker encoder model, using a set of audios selected from the training set of text-audio pairs and corresponding speaker embeddings for the speakers of the set of audios from the set of speaker embeddings, to obtain the first set of trained model parameters for the neural speaker encoder model. 3. The computer-implemented method of claim 1 wherein the first set of trained model parameters for the neural speaker encoder model and the second set of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: training the neural multi-speaker generative model, using as inputs, for a speaker, the training set of text-audio pairs and a speaker embedding corresponding to the speaker identifier for that speaker, to obtain a third set of trained model parameters for the neural multi-speaker generative model and to obtain a set of speaker embeddings corresponding to the speaker identifiers; training the neural speaker encoder model, using a set of audios selected from the training set of text-audio pairs and corresponding speaker embeddings for the speakers of the set of audios from the first set of speaker embeddings, to obtain a fourth set of trained model parameters for the neural speaker encoder model; and performing joint training the neural multi-speaker generative model comprising the third set of trained model parameters and the neural speaker encoder model comprising the fourth set of trained model parameters to adjust at least some of the third and fourth trained model parameters to obtain the first set of trained model parameters for the neural speaker encoder model and the second set of trained model parameters for the neural multi-speaker generative model by comparing synthesized audios generated by the neural multi-speaker generative model using speaker embeddings from the neural speaker encoder model to ground truth audios corresponding to the synthesized audios. 4. The computer-implemented method of claim 3 further comprising, as part of the joint training, adjusting at least some of parameters of the set of speaker embeddings. 5. The computer-implemented method of claim 1 wherein the first set of trained model parameters for the neural speaker encoder model and the second sets of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: performing joint training of the neural multi-speaker generative model and the neural speaker encoder model to obtain the first set of trained model parameters for the neural speaker encoder model and the second set of trained model parameters for the neural multi-speaker generative model by comparing synthesized audios generated by the neural multi-speaker generative model using speaker embeddings from the neural speaker encoder model to ground truth audios corresponding to the synthesized audios. 6. The computer-implemented method of claim 1 wherein the neural speaker encoder model comprises a neural network architecture comprising: a spectral processing network component that computes a spectral audio representation for input audio and passes the spectral audio representation to a prenet component comprising one or more fully-connected layers with one or more non-linearity units for feature transformation; a temporal processing network component in which temporal contexts are incorporated using a plurality of convolutional layers with gated linear unit and residual connections; and a cloning sample attention network component comprising a multi-head self-attention mechanism that determines weights for different audios and obtains aggregated speaker embeddings. 7. A generative text-to-speech system comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a limited set of one or more audios of a new speaker that was not part of training data used to train a neural multi-speaker generative model, using a speaker encoder model comprising a first set of trained model parameters to obtain a speaker embedding for the new speaker given the limited set of one or more audios as an input to the speaker encoder model; and using the neural multi-speaker generative model comprising a second set of trained model parameters, an input text, and the speaker embedding for the new speaker generated by the speaker encoder model comprising the first set of trained model parameters to generate a synthesized audio representation for the input text in which the synthesized audio includes speech characteristics of the new speaker, wherein the neural multi-speaker generative model comprising the second set of trained parameters was trained using as inputs, for a speaker, (1) a training set of text-audio pairs, in which a text-audio pair comprises a text and a corresponding audio of that text by the speaker, and (2) a speaker embedding corresponding to a speaker identifier for that speaker. 8. The generative text-to-speech system of claim 7 wherein the first set of trained model parameters for the speaker encoder model and the second sets of trained model parameters for the neural multi-speaker generative model were obtain by performing the steps comprising: training the neural multi-speaker generative model, using as inputs, for a speaker, the training set of text-audio pairs and a speaker embedding corresponding to the speaker identifier for that speaker, to obtain the second set of trained model parameters for the neural multi-speaker generative model and to obtain a set of speaker embeddings corresponding to the speaker identifiers; and training the speaker encoder model, using a set of audios se

Assignees

Inventors

Classifications

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • for measuring the quality of voice signals · CPC title

  • Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title

  • G10L13/033Primary

    Voice editing, e.g. manipulating the voice of the synthesiser · CPC title

  • G10L13/047Primary

    Architecture of speech synthesisers · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11238843B2 cover?
Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are ba…
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).