Synthetic speech processing
US-11735156-B1 · Aug 22, 2023 · US
US12027151B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12027151-B2 |
| Application number | US-202117455667-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 18, 2021 |
| Priority date | Dec 11, 2020 |
| Publication date | Jul 2, 2024 |
| Grant date | Jul 2, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.
Opening claim text (preview).
What is claimed is: 1. A linguistic content and speaking style disentanglement model, the model comprising: a content encoder configured to: receive, as input, input speech; and generate, as output, a latent representation of linguistic content for the input speech, the content encoder trained to disentangle speaking style information from the latent representation of linguistic content; a style encoder configured to: receive, as input, the same or different input speech; and generate, as output, a latent representation of speaking style for the same or different input speech, the style encoder trained to disentangle linguistic content information from the latent representation of speaking style; and a decoder configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech; wherein the content encoder comprises: one or more convolutional layers configured to receive the input speech as input and generate an initial discrete per-timestep latent representation of the linguistic content; and a vector quantization (VQ) layer configured to apply an information bottleneck with straight-through gradients on each initial discrete per-timestep latent representation of the linguistic content to generate the latent representation of linguistic content as a sequence of latent variables representing the linguistic content from the input speech; and wherein the style encoder comprises: one or more convolutional layers configured to receive the input speech as input; and a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of speaking style; and wherein the content encoder and the style encoder are trained using a mutual information loss to minimize mutual information captured in the latent representations of linguistic content and speaking style. 2. The model of claim 1 , wherein the content encoder generates the latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the input speech. 3. The model of claim 1 , wherein the content encoder is trained using a content VQ loss based on the latent representations of linguistic content generated for each timestep, the content VQ loss encouraging the content encoder to minimize a distance between an output and a nearest codebook. 4. The model of claim 1 , wherein: during training, the global style latent variable is sampled from a mean and variance of style latent variables predicted by the style encoder; and during inference, the global style latent variable is sampled from the mean of the global latent style variables predicted by the style encoder. 5. The model of claim 1 , wherein the style encoder is trained using a style regularization loss based on a mean and variance of style latent variables predicted by the style encoder, the style encoder using the style regularization loss to minimize a Kullback-Leibler (KL) divergence between a Gaussian posterior with a unit Gaussian prior. 6. The model of claim 1 , wherein the decoder is configured to: receive, as input, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same input speech; and generate, as output, the output speech comprising a reconstruction of the input speech. 7. The model of claim 6 , wherein the model is trained using a reconstruction loss between the input speech and the reconstruction of the input speech output from the decoder. 8. The model of claim 1 , wherein the decoder is configured to: receive, as input, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the different input speech; and generate, as output, the output speech comprising linguistic content information specified by the input speech and speaking style information specified by the different input speech. 9. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving input speech; processing, using a content encoder, the input speech to generate a latent representation of linguistic content for the input speech, wherein the content encoder is trained to disentangle speaking style information from the latent representation of linguistic content; processing, using a style encoder, the same or different input speech to generate a latent representation of speaking style for the same or different input speech, wherein the style encoder is trained to disentangle linguistic content information from the latent representation of speaking style; and processing, using a decoder, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech to generate output speech; wherein the content encoder comprises: one or more convolutional layers configured to receive the input speech as input and generate an initial discrete per-timestep latent representation of the linguistic content; and a vector quantization (VQ) layer configured to apply an information bottleneck with straight-through gradients on each initial discrete per-timestep latent representation of the linguistic content to generate the latent representation of linguistic content as a sequence of latent variables representing the linguistic content from the input speech; and wherein the style encoder comprises: one or more convolutional layers configured to receive the input speech as input; and a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of speaking style; and wherein the content encoder and the style encoder are trained using a mutual information loss to minimize mutual information captured in the latent representations of linguistic content and speaking style. 10. The computer-implemented method of claim 9 , wherein processing the input speech to generate the latent representation of linguistic content comprises processing the input speech to generate the latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the input speech. 11. The computer-implemented method of claim 9 , wherein the content encoder is trained using a content VQ loss based on the latent representations of linguistic content generated for each timestep, the content VQ loss encouraging the content encoder to minimize a distance between an output and a nearest codebook. 12. The computer-implemented method of claim 9 , wherein the operations further comprise: during training, sampling the global style latent variable from a mean and variance of style latent variables predicted by the style encoder; and during inference, sampling the global style latent variable from the mean of the global latent style variables predicted by the style encoder. 13. The computer-implemented method of claim 9 , wherein the style encoder is trained using a style regularization loss based on a mean and variance of style latent variables predicted by the style encoder, t
Auto-encoder networks; Encoder-decoder networks · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Generative networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.