Normalizing flows with neural splines for high-quality speech synthesis

US12488778B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12488778-B2
Application numberUS-202318099840-A
CountryUS
Kind codeB2
Filing dateJan 20, 2023
Priority dateJul 26, 2022
Publication dateDec 2, 2025
Grant dateDec 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing generative text-to-speech models. The techniques include identifying a mapping of speech characteristics (SC) on a target distribution of a latent variable using a non-linear transformation for at least a subset of the SC. Parameters of the non-linear transformation are determined using a neural network that approximates a statistics of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method to obtain a speech model, the method comprising: filling, with synthetic values, one or more gaps in a time series of a speech characteristics (SC); identifying, using one or more iterations, a mapping of the time series of the SC on a target distribution of a latent variable, wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and wherein parameters of the non-linear invertible transformations are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and generating, using the identified mapping, a speech signal corresponding to an input text. 2 . The method of claim 1 , wherein the non-linear invertible transformation comprises a plurality of non-linear transformations, each of the plurality of non-linear transformations used for a respective domain of a plurality of domains of the SC. 3 . The method of claim 2 , wherein each of the plurality of non-linear transformations comprises a second-order polynomial transformation. 4 . The method of claim 1 , wherein the target distribution is a Gaussian distribution. 5 . The method of claim 1 , wherein the subset of the time series of the SC comprises a first half of the time series of the SC, and wherein each of the one or more iterations keeps unchanged a second half of the time series of the SC. 6 . The method of claim 1 , further comprising: identifying an additional mapping of a time series of an additional SC on an additional target distribution of an additional latent variable, wherein identifying the additional mapping comprises identifying an additional non-linear invertible transformation of at least a subset of the time series of the additional SC. 7 . The method of claim 6 , wherein the SC comprises a representation of a frequency of a speech, and wherein the additional SC comprises a representation of an amplitude of the speech. 8 . The method of claim 1 , wherein the synthetic values for each gap of the one or more gaps are determined based on a local neighborhood of the SC adjacent to a respective gap of the one or more gaps. 9 . The method of claim 1 , wherein the synthetic values for each gap of the one or more gaps are determined using a context neural network that correlates a respective gap of the one or more gaps with a spoken phoneme sequence. 10 . The method of claim 9 , wherein an output of the context neural network is modified using a mask that identifies individual frames of the time series as one of a voiced frame or an unvoiced frame. 11 . The method of claim 1 , further comprising: grouping the time series of the SC into data units comprising values of the SC associated with two or more different times. 12 . The method of claim 11 , wherein each of the data units further comprises one or more discrete time derivatives of the SC. 13 . The method of claim 1 , wherein the neural network is trained to approximate the statistics of the times series of the SC in view of a spoken phoneme sequence. 14 . The method of claim 1 , wherein generating the speech signal comprises: probabilistically sampling the SC using the target distribution of the latent variable and the identified mapping. 15 . A system comprising: a memory device; and one or more processing devices, communicatively coupled to the memory device, to: fill, with synthetic values, one or more gaps in a time series of a speech characteristics (SC); identify, using one or more iterations, a mapping of a time series of a speech characteristics (SC) on a target distribution of a latent variable, wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and wherein parameters of the non-linear invertible transformation are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and generate, using the identified mapping, a speech signal corresponding to an input text. 16 . The system of claim 15 , wherein the non-linear invertible transformation comprises a plurality of non-linear transformations, each of the plurality of non-linear transformations used for a respective domain of a plurality of domains of the SC. 17 . The system of claim 15 , wherein the one or more processing devices are further to: group the time series of the SC into data units comprising values of the SC associated with two or more different times. 18 . The system of claim 17 , wherein each of the data units further comprises one or more discrete time derivatives of the SC. 19 . A non-transitory computer-readable medium storing instructions thereon, wherein the instructions, when executed by a processing device, cause the processing device to: fill, with synthetic values, one or more gaps in a time series of a speech characteristics (SC); identify, using one or more iterations, a mapping of a time series of a speech characteristics (SC) on a target distribution of a latent variable, wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and wherein parameters of the non-linear invertible transformation are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and generate, using the identified mapping, a speech signal corresponding to an input text.

Assignees

Inventors

Classifications

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • G10L25/30Primary

    using neural networks · CPC title

  • Architecture of speech synthesisers · CPC title

  • G10L13/027Primary

    Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12488778B2 cover?
Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing generative text-to-speech models. The techniques include identifying a mapping of speech characteristics (SC) on a target distribution of a latent variable using a non-linear transformation for at least a subset of the SC. Parameters of the non-linear transformation are determined using a neural ne…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).