Method and apparatus for speech source separation based on a convolutional neural network
US-2022223144-A1 · Jul 14, 2022 · US
US12488778B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12488778-B2 |
| Application number | US-202318099840-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 20, 2023 |
| Priority date | Jul 26, 2022 |
| Publication date | Dec 2, 2025 |
| Grant date | Dec 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing generative text-to-speech models. The techniques include identifying a mapping of speech characteristics (SC) on a target distribution of a latent variable using a non-linear transformation for at least a subset of the SC. Parameters of the non-linear transformation are determined using a neural network that approximates a statistics of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable.
Opening claim text (preview).
What is claimed is: 1 . A method to obtain a speech model, the method comprising: filling, with synthetic values, one or more gaps in a time series of a speech characteristics (SC); identifying, using one or more iterations, a mapping of the time series of the SC on a target distribution of a latent variable, wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and wherein parameters of the non-linear invertible transformations are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and generating, using the identified mapping, a speech signal corresponding to an input text. 2 . The method of claim 1 , wherein the non-linear invertible transformation comprises a plurality of non-linear transformations, each of the plurality of non-linear transformations used for a respective domain of a plurality of domains of the SC. 3 . The method of claim 2 , wherein each of the plurality of non-linear transformations comprises a second-order polynomial transformation. 4 . The method of claim 1 , wherein the target distribution is a Gaussian distribution. 5 . The method of claim 1 , wherein the subset of the time series of the SC comprises a first half of the time series of the SC, and wherein each of the one or more iterations keeps unchanged a second half of the time series of the SC. 6 . The method of claim 1 , further comprising: identifying an additional mapping of a time series of an additional SC on an additional target distribution of an additional latent variable, wherein identifying the additional mapping comprises identifying an additional non-linear invertible transformation of at least a subset of the time series of the additional SC. 7 . The method of claim 6 , wherein the SC comprises a representation of a frequency of a speech, and wherein the additional SC comprises a representation of an amplitude of the speech. 8 . The method of claim 1 , wherein the synthetic values for each gap of the one or more gaps are determined based on a local neighborhood of the SC adjacent to a respective gap of the one or more gaps. 9 . The method of claim 1 , wherein the synthetic values for each gap of the one or more gaps are determined using a context neural network that correlates a respective gap of the one or more gaps with a spoken phoneme sequence. 10 . The method of claim 9 , wherein an output of the context neural network is modified using a mask that identifies individual frames of the time series as one of a voiced frame or an unvoiced frame. 11 . The method of claim 1 , further comprising: grouping the time series of the SC into data units comprising values of the SC associated with two or more different times. 12 . The method of claim 11 , wherein each of the data units further comprises one or more discrete time derivatives of the SC. 13 . The method of claim 1 , wherein the neural network is trained to approximate the statistics of the times series of the SC in view of a spoken phoneme sequence. 14 . The method of claim 1 , wherein generating the speech signal comprises: probabilistically sampling the SC using the target distribution of the latent variable and the identified mapping. 15 . A system comprising: a memory device; and one or more processing devices, communicatively coupled to the memory device, to: fill, with synthetic values, one or more gaps in a time series of a speech characteristics (SC); identify, using one or more iterations, a mapping of a time series of a speech characteristics (SC) on a target distribution of a latent variable, wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and wherein parameters of the non-linear invertible transformation are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and generate, using the identified mapping, a speech signal corresponding to an input text. 16 . The system of claim 15 , wherein the non-linear invertible transformation comprises a plurality of non-linear transformations, each of the plurality of non-linear transformations used for a respective domain of a plurality of domains of the SC. 17 . The system of claim 15 , wherein the one or more processing devices are further to: group the time series of the SC into data units comprising values of the SC associated with two or more different times. 18 . The system of claim 17 , wherein each of the data units further comprises one or more discrete time derivatives of the SC. 19 . A non-transitory computer-readable medium storing instructions thereon, wherein the instructions, when executed by a processing device, cause the processing device to: fill, with synthetic values, one or more gaps in a time series of a speech characteristics (SC); identify, using one or more iterations, a mapping of a time series of a speech characteristics (SC) on a target distribution of a latent variable, wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and wherein parameters of the non-linear invertible transformation are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and generate, using the identified mapping, a speech signal corresponding to an input text.
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
using neural networks · CPC title
Architecture of speech synthesisers · CPC title
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.