Synthesis of speech from text in a voice of a target speaker using neural networks

US2025095630A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025095630-A1
Application numberUS-202418966088-A
CountryUS
Kind codeA1
Filing dateDec 2, 2024
Priority dateMay 17, 2018
Publication dateMar 20, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining training data pairs each comprising training text and a corresponding audio representation of speech of the training text spoken by a target speaker in a first language; training a speech synthesis system on the training data pairs to teach the speech synthesis system to learn how to synthesize speech in a voice of the target speaker; receiving an input text utterance in a second language different than the first language; and generating, using the trained speech synthesis system, by processing the input text utterance in the second language, a synthesized audio representation of the input text utterance in the second language and spoken in the voice of the target speaker. 2 . The computer-implemented method of claim 1 , wherein the input text utterance is characterized by a sequence of graphemes. 3 . The computer-implemented method of claim 1 , wherein the input text utterance is characterized by a sequence of phonemes. 4 . The computer-implemented method of claim 1 , wherein the speech synthesis system comprises a speaker encoder network and a spectrogram generation network. 5 . The computer-implemented method of claim 4 , wherein the speaker encoder network is trained to extract speaker embedding vectors from the corresponding audio representations of speech of the training text spoken by the target speaker in a first language. 6 . The computer-implemented method of claim 4 , wherein the speaker encoder network comprises a long short-term memory (LSTM) neural network. 7 . The computer-implemented method of claim 4 , wherein training the speech synthesis system comprises training the speaker encoder network separately from training the spectrogram generation network. 8 . The computer-implemented method of claim 7 , wherein, during training of the spectrogram generation network, parameters of the speaker encoder network are fixed. 9 . The computer-implemented method of claim 4 , wherein the spectrogram generation neural network comprises an encoder neural network and a decoder neural network. 10 . The computer-implemented method of claim 9 , wherein the spectrogram generation neural network further comprises an attention layer. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining training data pairs each comprising training text and a corresponding audio representation of speech of the training text spoken by a target speaker in a first language; training a speech synthesis system on the training data pairs to teach the speech synthesis system to learn how to synthesize speech in a voice of the target speaker; receiving an input text utterance in a second language different than the first language; and generating, using the trained speech synthesis system, by processing the input text utterance in the second language, a synthesized audio representation of the input text utterance in the second language and spoken in the voice of the target speaker. 12 . The system of claim 11 , wherein the input text utterance is characterized by a sequence of graphemes. 13 . The system of claim 11 , wherein the input text utterance is characterized by a sequence of phonemes. 14 . The system of claim 11 , wherein the speech synthesis system comprises a speaker encoder network and a spectrogram generation network. 15 . The system of claim 14 , wherein the speaker encoder network is trained to extract speaker embedding vectors from the corresponding audio representations of speech of the training text spoken by the target speaker in a first language. 16 . The system of claim 14 , wherein the speaker encoder network comprises a long short-term memory (LSTM) neural network. 17 . The system of claim 14 , wherein training the speech synthesis system comprises training the speaker encoder network separately from training the spectrogram generation network. 18 . The system of claim 17 , wherein, during training of the spectrogram generation network, parameters of the speaker encoder network are fixed. 19 . The system of claim 14 , wherein the spectrogram generation neural network comprises an encoder neural network and a decoder neural network. 20 . The system of claim 19 , wherein the spectrogram generation neural network further comprises an attention layer.

Assignees

Inventors

Classifications

  • Transfer learning · CPC title

  • Supervised learning · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Overlap-add techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025095630A1 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representati…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 20 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).