Acoustic model generation method and device, and speech synthesis method
US-10614795-B2 · Apr 7, 2020 · US
US12579991B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12579991-B2 |
| Application number | US-202118248808-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 15, 2021 |
| Priority date | Oct 16, 2020 |
| Publication date | Mar 17, 2026 |
| Grant date | Mar 17, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A neural network system is provided, implementing a generative model for autoregressively generating a distribution for a plurality of current filter-bank samples of an audio signal, wherein the current samples correspond to a current time slot, and each current sample corresponds to a channel of the filter-bank. The system includes a hierarchy of a plurality of neural network processing tiers ordered from a top to a bottom tier, each tier trained to generate conditioning information based on previous filter-bank samples and, for at least each tier but the top tier, also on the conditioning information from a tier higher up in the hierarchy, and an output stage trained to generate the probability distribution based on previous samples for one or more previous time slots and the conditioning information from the lowest processing tier.
Opening claim text (preview).
The invention claimed is: 1 . A computer implemented neural network system for autoregressively generating a plurality of current filter-bank samples of a filter-bank representation of an audio signal, wherein the current filter-bank samples correspond to a current time slot, and wherein each current filter-bank sample corresponds to a respective channel of the filter-bank, including: a hierarchy of a plurality of neural network processing tiers ordered from a top processing tier to a bottom processing tier, wherein each processing tier has been trained to generate conditioning information based on previous filter-bank samples of the filter-bank representation and, for at least each processing tier but the top tier, also on the conditioning information generated by a processing tier higher up in the hierarchy, and an output stage that has been trained to generate a probability distribution for said plurality of current filter-bank samples based on previous filter-bank samples corresponding to one or more previous time slots of the filter-bank representation and the conditioning information generated from the lowest processing tier, said output stage being configured to sample the probability distribution to obtain said plurality of current filter bank samples, wherein the output stage includes the bottom processing tier, and wherein the bottom processing tier is subdivided into a plurality of sequentially executed sub-layers, wherein each sub-layer has been trained to generate the probability distribution for one or more current filter-bank samples corresponding to a true subset of the channels of the filter-bank and, at least for all but a first executed sub-layer, each sub-layer has been trained to generate the probability distribution also based on current filter-bank samples generated by one or more previously executed sub-layers. 2 . The system of claim 1 , where each processing tier has been trained to generate the conditioning information also based on additional side information provided for the current time slot. 3 . The system of claim 1 , further including means configured for generating the plurality of current filter-bank samples of the filter-bank representation by sampling from the probability distribution. 4 . They system of claim 3 , wherein the probability distribution for the current filter-bank samples is obtained using a mixture model. 5 . The system of claim 4 , wherein generating the probability distribution includes providing an update of a linear transformation for a mixture coefficient of the mixture model, wherein the linear transformation is defined by a triangular matrix with ones on its main diagonal, and wherein the triangular matrix has a number of non-zero diagonals greater than one and smaller than the number of channels of the filter-bank. 6 . The system of claim 1 , wherein each processing tier includes convolutional modules configured for receiving the previous filter-bank samples of the filter-bank representation, wherein each convolutional module has a same number of input channels as a number of channels of the filter-bank, and wherein kernel sizes of the convolutional modules decrease from the top processing tier to the bottom processing tier in the hierarchy. 7 . The system of claim 6 , wherein each processing tier includes at least one recurrent unit configured for receiving as its input a sum of the outputs from the convolutional modules, and, for at least each processing tier but the lowest processing tier, at least one learned upsampling module configured to receive as its input an output from the at least one recurrent unit and to generate as its output the conditioning information. 8 . The system of claim 7 , further including an additional recurrent unit common to all sub-layers of the bottom processing tier and configured for receiving as its input a mix of i) the sum of the outputs from the convolutional modules and ii) the output of the at least one recurrent unit, and to based thereon generate additional side information to a respective sub-output stage of each sub-layer. 9 . The system of claim 1 , wherein the first executed sub-layer generates one or more current filter-bank samples corresponding to at least the lowest channel of the filter-bank, and wherein the last executed sub-layer generates one or more current filter-bank samples corresponding to at least the highest channel of the filter-bank. 10 . The system of claim 1 , wherein the probability distribution for the current filter-bank samples is obtained using a mixture model. 11 . The system of claim 10 , wherein generating the probability distribution includes providing an update of a linear transformation for a mixture coefficient of the mixture model, wherein the linear transformation is defined by a triangular matrix with ones on its main diagonal, and wherein the triangular matrix has a number of non-zero diagonals greater than one and smaller than the number of channels of the filter-bank. 12 . The system of claim 5 , wherein the sampling includes a transformation with the linear transformation. 13 . A non-transitory computer readable medium storing instructions operable, when executed by at least one computer processor belonging to a computer hardware, to implement the system according to claim 1 using said computer hardware. 14 . A computer implemented neural network system for autoregressively generating a plurality of current filter-bank samples of a filter-bank representation of an audio signal, wherein the current filter-bank samples correspond to a current time slot, and wherein each current filter-bank sample corresponds to a respective channel of the filter-bank, including: a hierarchy of a plurality of neural network processing tiers ordered from a top processing tier to a bottom processing tier, wherein each processing tier has been trained to generate conditioning information based on previous filter-bank samples of the filter-bank representation and, for at least each processing tier but the top tier, also on the conditioning information generated by a processing tier higher up in the hierarchy, and an output stage that has been trained to generate a probability distribution for said plurality of current filter-bank samples based on previous filter-bank samples corresponding to one or more previous time slots for the filter-bank representation and the conditioning information generated from the lowest processing tier, said output stage being configured to sample said probability distribution to obtain said plurality of current filter bank samples, wherein each processing tier includes convolutional modules configured for receiving the previous filter-bank samples of the filter-bank representation, wherein each convolutional module has a same number of input channels as a number of channels of the filter-bank, and wherein kernel sizes of the convolutional modules decrease from the top processing tier to the bottom processing tier in the hierarchy. 15 . A method for autoregressively generating a plurality of current filter-bank samples of a filter-bank representation of an audio signal, wherein the current filter-bank samples correspond to a current time slot, and wherein each current filter-bank sample corresponds to a respective channel of the filter-bank, including generating and sampling a probability distribution by using the system of any one of the preceding claims . 16 . The method of claim 15 , comprising the steps of: using the plurality of neural network processing tiers to generate conditioning information, wherein the conditioning informa
using neural networks · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Generative networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.