What technology area does this patent fall under?

Primary CPC classification G10L13/033. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 19 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Synthesis of speech from text in a voice of a target speaker using neural networks

US11848002B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11848002-B2
Application number	US-202217813361-A
Country	US
Kind code	B2
Filing date	Jul 19, 2022
Priority date	May 17, 2018
Publication date	Dec 19, 2023
Grant date	Dec 19, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining a speech spectrogram corresponding to an utterance spoken of a target speaker; obtaining an input sequence of phonemes to be synthesized into speech; extracting, using a speaker encoder network, a speaker embedding vector characterizing a voice of the target speaker from the speech spectrogram; generating, using a synthesizer configured to receive the input sequence of phonemes and the speaker embedding vector as input, a mel spectrogram representation of the input sequence of phonemes in the voice of the target speaker; and providing the mel spectrogram representation of the input sequence of phonemes in the voice of the target speaker for output. 2. The method of claim 1 , wherein the speech spectrogram corresponding to the utterance spoken by the target speaker comprises an arbitrary length mel spectrogram. 3. The method of claim 1 , wherein the speaker encoder network is trained to extract speaker embedding vectors from speech spectrograms corresponding to utterances spoken by the same speaker that are close together in an embedding space. 4. The method of claim 1 , wherein the speaker encoder network is trained to extract speaker embedding vectors from speech spectrograms corresponding to utterances spoken by different speakers that are distant from each other. 5. The method of claim 1 , wherein the speaker encoder network is trained separately from the synthesizer. 6. The method of claim 5 , wherein, during training of the synthesizer, parameters of the speaker encoder network are fixed. 7. The method of claim 1 , wherein the synthesizer comprises a spectrogram generation neural network that is trained to predict mel spectrograms from a sequence of phoneme inputs. 8. The method of claim 7 , wherein the spectrogram generation neural network comprises a sequence-to-sequence attention neural network. 9. The method of claim 7 , wherein the spectrogram generation neural network comprises an encoder neural network and a decoder neural network. 10. The method of claim 9 , wherein the spectrogram generation neural network further comprises an attention layer. 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a speech spectrogram corresponding to an utterance spoken of a target speaker; obtaining an input sequence of phonemes to be synthesized into speech; extracting, using a speaker encoder network, a speaker embedding vector characterizing a voice of the target speaker from the speech spectrogram; generating, using a synthesizer configured to receive the input sequence of phonemes and the speaker embedding vector as input, a mel spectrogram representation of the input sequence of phonemes in the voice of the target speaker; and providing the mel spectrogram representation of the input sequence of phonemes in the voice of the target speaker for output. 12. The system of claim 11 , wherein the speech spectrogram corresponding to the utterance spoken by the target speaker comprises an arbitrary length mel spectrogram. 13. The system of claim 11 , wherein the speaker encoder network is trained to extract speaker embedding vectors from speech spectrograms corresponding to utterances spoken by the same speaker that are close together in an embedding space. 14. The system of claim 11 , wherein the speaker encoder network is trained to extract speaker embedding vectors from speech spectrograms corresponding to utterances spoken by different speakers that are distant from each other. 15. The system of claim 11 , wherein the speaker encoder network is trained separately from the synthesizer. 16. The system of claim 15 , wherein, during training of the synthesizer, parameters of the speaker encoder network are fixed. 17. The system of claim 11 , wherein the synthesizer comprises a spectrogram generation neural network that is trained to predict mel spectrograms from a sequence of phoneme inputs. 18. The system of claim 17 , wherein the spectrogram generation neural network comprises a sequence-to-sequence attention neural network. 19. The system of claim 17 , wherein the spectrogram generation neural network comprises an encoder neural network and a decoder neural network. 20. The system of claim 19 , wherein the spectrogram generation neural network further comprises an attention layer.

Assignees

Google Llc

Inventors

Classifications

G06N3/096
Transfer learning · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G10L2013/021
Overlap-add techniques · CPC title

Patent family

Related publications grouped by family.

View patent family 66770584

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11848002B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representati…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 19 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and methods for parallel wave generation in end-to-end text-to-speech

Speaker recognition

Speech recognition by selecting and refining hot words

Digital watermark embedding device, digital watermark embedding method, and computer-readable recording medium

Frequently asked questions