Two-level speech prosody transfer

US11514888B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11514888-B2
Application numberUS-202016992410-A
CountryUS
Kind codeB2
Filing dateAug 13, 2020
Priority dateAug 13, 2020
Publication dateNov 29, 2022
Grant dateNov 29, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, at data processing hardware, an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice; generating, by the data processing hardware, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance, the intermediate synthesized speech representation possessing the intended prosody; and providing, by the data processing hardware, the intermediate synthesized speech representation to a second TTS model, the second TTS model comprising: an encoder portion configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody; and a decoder portion configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech, the output audio signal having the intended prosody specified by the utterance embedding and speaker characteristics of the target voice. 2. The method of claim 1 , further comprising: sampling, by the data processing hardware, from the intermediate synthesized speech representation, a sequence of fixed-length reference frames providing prosodic features that represent the intended prosody possessed by the intermediate synthesized speech representation, wherein providing the intermediate synthesized speech representation to the second TTS model comprises providing the sequence of fixed-length reference frames sampled from the intermediate synthesized speech representation to the encoder portion, the encoder portion configured to encode the sequence of fixed-length reference frames into the utterance embedding. 3. The method of claim 2 , wherein the prosodic features that represent the intended prosody possessed by the intermediate synthesized speech representation comprise one or more of duration, pitch contour, energy contour, or mel-frequency spectrogram contour. 4. The method of claim 2 , wherein the encoder portion is configured to encode the sequence of fixed-length reference frames into the utterance embedding by, for each syllable in the intermediate synthesized speech representation: encoding phoneme-level linguistic features associated with each phoneme in the syllable into a phoneme feature-based syllable embedding; encoding the fixed-length reference frames associated with the syllable into a frame-based syllable embedding, the frame-based syllable embedding indicative of one or more of a duration, pitch, or energy associated with the corresponding syllable; and encoding, into a corresponding prosodic syllable embedding for the syllable, the phoneme feature-based and the frame-based syllable embedding with syllable-level linguistic features associated with the syllable, sentence-level linguistic features associated with the intermediate synthesized speech representation, and word-level linguistic features associated with a word that includes the corresponding syllable. 5. The method of claim 4 , wherein the word-level linguistic features comprise a wordpiece embedding obtained from a sequence of wordpiece embeddings generated by a Bidirectional Encoder Representations from Transformers (BERT) model from the input text utterance. 6. The method of claim 2 , wherein the decoder portion is configured to process the input text utterance and the utterance embedding to generate the output audio signal by decoding, using the input text utterance, the corresponding utterance embedding into a sequence of fixed-length predicted frames providing a prosodic representation of the input text utterance, the prosodic representation representing the intended prosody specified by the utterance embedding. 7. The method of claim 6 , wherein the second TTS model is trained so that a number of the fixed-length predicted frames decoded by the decoder portion is equal to a number of the fixed-length reference frames sampled from the intermediate synthesized speech representation. 8. The method of claim 1 , wherein the utterance embedding comprises a fixed-length numerical vector. 9. The method of claim 1 , wherein: the intermediate synthesized speech representation comprises an audio waveform or a sequence of mel-frequency spectrograms that captures the intended prosody; and providing the intermediate synthesized speech representation to the second TTS model comprises providing the audio waveform or the sequence of mel-frequency spectrograms to the encoder portion, the encoder portion configured to encode the audio waveform or the sequence of mel-frequency spectrograms into the utterance embedding. 10. The method of claim 1 , further comprising: obtaining, by the data processing hardware, a speaker embedding representing the speaker characteristics of the target voice; and providing, by the data processing hardware, the speaker embedding to the decoder portion of the second TTS model, the decoder portion configured to process the input text utterance, the utterance embedding, and the speaker embedding to generate the output audio signal of expressive speech. 11. The method of claim 1 , wherein the intermediate synthesized speech representation generated using the first TTS model comprises an intermediate voice that lacks the speaker characteristics of the target voice and comprises one or more undesirable acoustic artifacts. 12. The method of claim 1 , further comprising: receiving, at the data processing hardware, training data including a plurality of training audio signals and corresponding transcripts, each training audio signal comprising an utterance of human speech having the intended prosody spoken by a corresponding speaker in a prosodic domain/vertical associated with the intended prosody, each transcript comprising a textual representation of the corresponding training audio signal; and for each corresponding transcript of the training data: training, by the data prosody hardware, the first TTS model to generate a corresponding reference audio signal comprising a training synthesized speech representation that captures the intended prosody of the corresponding utterance of human speech; training, by the data processing hardware, the encoder portion of the second TTS model by encoding the corresponding training synthesized speech representation into a corresponding utterance embedding representing the intended prosody captured by the training synthesized speech representation; training, by the data processing hardware, using the corresponding transcript of the training data, the decoder portion of the second TTS model by decoding the corresponding utterance embedding encoded by the encoder portion into a predicted output audio signal of expressive speech having the intended prosody; generating gradients/losses between the predicted output audio signal and the corresponding reference audio signal; and back-propagating the gradients/losses through the second TTS model. 13. The method of claim 1 , wherein the first TTS model and the second TTS model are trained separately. 14. The method of claim 1 , wherein the first TTS model includes a first neural network architecture and the second TTS model includes a second neural network architecture that is different than the first neural network architecture. 15. The method of claim 1 , wherein the first TTS model and the second TTS model include a same neural network architecture. 16. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions t

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

  • G10L13/047Primary

    Architecture of speech synthesisers · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11514888B2 cover?
A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing th…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/047. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 29 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).