What technology area does this patent fall under?

Primary CPC classification G10L13/04. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Synthesis of speech from text in a voice of a target speaker using neural networks

US11488575B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11488575-B2
Application number	US-201917055951-A
Country	US
Kind code	B2
Filing date	May 17, 2019
Priority date	May 17, 2018
Publication date	Nov 1, 2022
Grant date	Nov 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: obtaining an audio representation of speech of a target speaker; obtaining input text for which speech is to be synthesized in a voice of the target speaker; generating a speaker embedding vector by providing the audio representation to a speaker verification neural network that is trained to distinguish speakers from one another; generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker embedding vector to a spectrogram generation neural network that is trained using voices of reference speakers to generate audio representations; and providing the audio representation of the input text spoken in the voice of the target speaker to a vocoder to generate a time domain representation of the input text spoken in the voice of the target speaker; and providing the time domain representation for playback to a user. 2. The method of claim 1 , wherein the speaker verification neural network is trained to generate speaker embedding vectors of audio representations of speech from the same speaker that are close together in an embedding space while generating speaker embedding vectors of audio representations of speech from different speakers that are distant from each other. 3. The method of claim 1 , wherein the speaker verification neural network is trained separately from the spectrogram generation neural network. 4. The method of claim 1 , wherein the speaker verification neural network is a long short-term memory (LSTM) neural network. 5. A computer-implemented method comprising: obtaining an audio representation of speech of a target speaker; obtaining input text for which speech is to be synthesized in a voice of the target speaker; generating a speaker embedding vector by: providing a plurality of overlapping sliding windows of the audio representation to a speaker verification neural network to generate a plurality of individual vector embeddings, the speaker verification neural network trained to distinguish speakers from one another; and generating the speaker embedding vector by computing an average of the individual vector embeddings; generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker embedding vector to a spectrogram generation neural network that is trained using voices of reference speakers to generate audio representations; and providing the audio representation of the input text spoken in the voice of the target speaker for output. 6. The method of claim 1 , wherein the vocoder comprises a vocoder neural network. 7. The method of claim 1 , wherein the spectrogram generation neural network is a sequence-to-sequence attention neural network that is trained to predict mel spectrograms from a sequence of phoneme or grapheme inputs. 8. The method of claim 7 , wherein the spectrogram generation neural network includes an encoder neural network, an attention layer, and a decoder neural network. 9. The method of claim 8 , wherein the spectrogram generation neural network concatenates the speaker embedding vector with outputs of the encoder neural network that are provided as input to the attention layer. 10. The method of claim 1 , wherein the speaker embedding vector is different from any speaker embedding vectors used during the training of the speaker verification neural network or the spectrogram generation neural network. 11. The method of claim 1 , wherein, during the training of the spectrogram generation neural network, parameters of the speaker verification neural network are fixed. 12. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform the operations of the method of claim 1 .

Assignees

Google Llc

Inventors

Classifications

G10L13/04Primary
Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title
G10L25/18
the extracted parameters being spectral information of each sub-band · CPC title
G10L13/033Primary
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
G10L2013/021
Overlap-add techniques · CPC title
G10L25/30
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 66770584

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11488575B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representati…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/04. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).