Unsupervised alignment for text to speech synthesis using neural networks

US11769481B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11769481-B2
Application numberUS-202117496569-A
CountryUS
Kind codeB2
Filing dateOct 7, 2021
Priority dateOct 7, 2021
Publication dateSep 26, 2023
Grant dateSep 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: determining, from a plurality of audio segments, respective phoneme durations, phoneme pitches, and phoneme energies; determining a first alignment between a sequence of text and a total speech duration corresponding to probable locations for the respective phoneme durations; determining a second alignment for an audio segment of synthesized speech based, at least in part, on a first distribution corresponding to the phoneme durations and the first alignment; and generating, for the sequence of text, an audio segment comprising a synthesized recitation of the sequence of text based, at least in part, on the second alignment and at least one of a second distribution corresponding to the phoneme pitches, or a third distribution corresponding to the phoneme energies. 2. The computer-implemented method of claim 1 , further comprising: generating a fourth distribution corresponding to one or more properties associated with the synthesized recitation based, at least in part, on the fourth distribution. 3. The computer-implemented method of claim 1 , further comprising: applying, to the second alignment, a prior distribution to exclude pairs of phonemes and durations from the plurality of audio segments that are outside of a specified range. 4. The computer-implemented method of claim 3 , wherein the prior distribution is cigar-shaped. 5. The computer-implemented method of claim 3 , wherein the prior distribution is constructed from a beta-binomial distribution. 6. The computer-implemented method of claim 1 , further comprising: determining, from the sequence of text, a plurality of text tokens; and aligning each of the plurality of text tokens to a respective mel frame based, at least in part, on the second alignment. 7. The computer-implemented method of claim 6 , wherein the second alignment is based, at least in part, on an L2 distance between the mel frame at a first time and a text phoneme in the sequence of text. 8. The computer-implemented method of claim 1 , wherein the synthesized recitation is generative such that a first synthesized recitation is different from a second synthesized recitation, each of the first synthesized recitation and the second synthesized recitation based on the sequence of text. 9. A method, comprising: determining, from a plurality of audio samples including human speech, alignments between text of the plurality of audio samples, a duration of the plurality of audio samples, and at least one of a pitch of the audio samples or an energy of the audio samples; generating an alignment distribution based, at least in part, on the alignments; determining a soft alignment between a first text sequence from the text and mel-frames of the alignment distribution, the soft alignment normalizing probability distributions for the alignments across the duration of the plurality of audio samples; and determining a hard alignment between the first text sequence and the mel-frames of the alignment distribution, the hard alignment concentrating the probability distributions from the soft alignment to a symbol for the alignments across the duration of the plurality of audio samples; determining one or more vectors, based on the hard alignment, corresponding to one or more speaker characteristics; receiving a second text sequence; and generating, based at least in part on the second text sequence and the one or more vectors, a synthetic audio clip corresponding to the second text sequence. 10. The method of claim 9 , wherein the alignment is based, at least in part, on an alignment matrix with a beta-binomial distribution. 11. The method of claim 9 , wherein an encoder and a decoder for generating the synthetic audio clip operate in parallel. 12. The method of claim 9 , further comprising: generating a second synthetic audio clip from the second text sequence, the second synthetic audio clip being different from the first synthetic audio clip. 13. The method of claim 9 , further comprising: generating at least one of a phoneme distribution, a pitch distribution, or an energy distribution; and sampling from at least one of the phoneme distribution, the pitch distribution, or the energy distribution. 14. A processor, comprising: one or more processing units to: receive an audio clip of human speech represented as a mel-spectrogram; determine an alignment matrix for the audio clip normalized to a probability distribution; apply, to the alignment matrix, a prior distribution to exclude pairs of phonemes and mel-frames in the audio clip from the alignment matrix; determine, from the alignment matrix, an alignment for a first text sequence within the audio clip and a plurality of mel-frames representing a duration of the audio clip; determine one or more distributions for a pitch and an energy associated with the audio clip; receive a second text sequence; and generate a second audio clip of the second text sequence based, at least in part, on the alignment and the one or more distributions. 15. The processor of claim 14 , wherein the one or more processing units are further to implement an encoder and decoder for generating the second audio clip, wherein the encoder and decoder operate in parallel. 16. The processor of claim 14 , wherein the one or more processing units are further to: fit, onto the alignment matrix, a beta-binomial distribution. 17. The processor of claim 16 , wherein the beta-binomial distribution excludes pairs of phonemes and mel-frames outside of a specified range. 18. The processor of claim 14 , wherein the one or more processing units are further to generate a third audio clip from the second text sequence, the second audio clip being different from the third audio clip.

Assignees

Inventors

Classifications

  • G10L13/047Primary

    Architecture of speech synthesisers · CPC title

  • Combinations of networks · CPC title

  • Learning methods · CPC title

  • Pitch control · CPC title

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11769481B2 cover?
Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved …
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G10L13/047. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).