What technology area does this patent fall under?

Primary CPC classification G10L13/047. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Phonemes and graphemes for neural text-to-speech

US12020685B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12020685-B2
Application number	US-202117643684-A
Country	US
Kind code	B2
Filing date	Dec 10, 2021
Priority date	Mar 26, 2021
Publication date	Jun 25, 2024
Grant date	Jun 25, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising: receiving, at an encoder of a speech synthesis model, a text input comprising a sequence of words represented as an input encoder embedding, the input encoder embedding comprising a plurality of tokens, the plurality of tokens comprising a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes, each grapheme token of the first set of grapheme tokens comprising a respective wordpiece sub-word of a respective word in the sequence of words, wherein each corresponding token of the plurality of tokens of the input encoder embedding represents a combination of: a respective word position embedding for each respective word in the sequence of words, the respective word position embedding representing sub-word level positions for both one or more of the grapheme tokens from the first set of grapheme tokens that correspond to the respective word and one or more of the phoneme tokens from the second set of phoneme tokens that correspond to the respective word; and a position embedding representing an overall index of position for each token of the plurality of tokens of the input encoder embedding; for each respective phoneme token of the second set of phoneme tokens: identifying, by the encoder, a respective word of the sequence of words corresponding to the respective phoneme token based on the respective word position embedding that represents the sub-word level position for the respective phoneme token that corresponds to the respective word; and determining, by the encoder, a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token by determining that the sub-word level position for the respective grapheme token that corresponds to the respective word is represented by the same respective word position embedding as the respective word position embedding representing the sub-word level position for the respective phoneme token; and generating, by the encoder, an output encoder embedding based on a relationship between each respective phoneme token and the respective grapheme token determined to represent a same respective word as the respective phoneme token. 2. The method of claim 1 , wherein the combination representing each token of the plurality of tokens of the input encoder embedding further comprises: one of a grapheme token embedding or a phoneme token embedding; and a segment embedding. 3. The method of claim 1 , wherein the speech synthesis model comprises an attention mechanism in communication with the encoder. 4. The method of claim 1 , wherein the speech synthesis model comprises a duration-based upsampler in communication with the encoder. 5. The method of claim 1 , wherein the plurality of tokens of the input encoder embedding comprises a special token identifying a language of the input text. 6. The method of claim 1 , wherein the operations further comprise: pre-training the encoder of the speech synthesis model by: feeding the encoder a plurality of training examples, each training example represented as a sequence of training grapheme tokens corresponding to a training sequence of words and a sequence of training phoneme tokens corresponding to the same training sequence of words; masking a training phoneme token from the sequence of training phoneme tokens for a respective word from the training sequence of words; and masking a training grapheme token from the sequence of training phoneme tokens for the respective word from the training sequence of words. 7. The method of claim 1 , wherein: the speech synthesis model comprises a multilingual speech synthesis model; and the operations further comprise pre-training the encoder of the speech synthesis model using a classification objective to predict a classification token of the plurality of tokens of the input encoder embedding, the classification token comprising a language identifier. 8. The method of claim 1 , wherein: the speech synthesis model comprises a multilingual speech synthesis model; and the output encoder embedding comprises a sequence of encoder tokens, each encoder token comprising language information about the input text. 9. The method of claim 1 , wherein: the speech synthesis model comprises a multi-accent speech synthesis model; and the operations further comprise pre-training the encoder of the speech synthesis model using a classification objective to predict a classification token of the plurality of tokens of the input encoder embedding, the classification token comprising an accent identifier. 10. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving, at an encoder of a speech synthesis model, a text input comprising a sequence of words represented as an input encoder embedding, the input encoder embedding comprising a plurality of tokens, the plurality of tokens comprising a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes, each grapheme token of the first set of grapheme tokens comprising a respective wordpiece sub-word of a respective word in the sequence of words, wherein each corresponding token of the plurality of tokens of the input encoder embedding represents a combination of: a respective word position embedding for each respective word in the sequence of words, the respective word position embedding representing sub-word level positions for both one or more of the grapheme tokens from the first set of grapheme tokens that correspond to the respective word and one or more of the phoneme tokens from the second set of phoneme tokens that correspond to the respective word; and a position embedding representing an overall index of position for each token of the plurality of tokens of the input encoder embedding; for each respective phoneme token of the second set of phoneme tokens: identifying, by the encoder, a respective word of the sequence of words corresponding to the respective phoneme token based on the respective word position embedding that represents the sub-word level position for the respective phoneme token that corresponds to the respective word; and determining, by the encoder, a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token by determining that the sub-word level position for the respective grapheme token that corresponds to the respective word is represented by the same respective word position embedding as the respective word position embedding representing the sub-word level position for the respective phoneme token; and generating, by the encoder, an output encoder embedding based on a relationship between each respective phoneme token and the respective grapheme token determined to represent a same respective word as the respective phoneme token. 11. The system of claim 10 , wherein the combination representing each token of the plurality of tokens of the input encoder embedding further comprises: one of a grapheme token embedding or a phoneme token embedding; and a segment embedding. 12. The system of claim 10 , wherein the speech synthesis model co

Assignees

Google Llc

Inventors

Classifications

G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/0895
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
G10L13/047Primary
Architecture of speech synthesisers · CPC title
G06N3/08
Learning methods · CPC title
G06F40/263
Language identification · CPC title

Patent family

Related publications grouped by family.

View patent family 79282928

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12020685B2 cover?: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method als…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/047. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Multilingual neural text-to-speech synthesis

Neural text-to-speech synthesis with multi-level text information

Systems and methods for real-time neural text-to-speech

Frequently asked questions