Unsupervised learning of disentangled speech content and style representation

US12027151B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12027151-B2
Application numberUS-202117455667-A
CountryUS
Kind codeB2
Filing dateNov 18, 2021
Priority dateDec 11, 2020
Publication dateJul 2, 2024
Grant dateJul 2, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.

First claim

Opening claim text (preview).

What is claimed is: 1. A linguistic content and speaking style disentanglement model, the model comprising: a content encoder configured to: receive, as input, input speech; and generate, as output, a latent representation of linguistic content for the input speech, the content encoder trained to disentangle speaking style information from the latent representation of linguistic content; a style encoder configured to: receive, as input, the same or different input speech; and generate, as output, a latent representation of speaking style for the same or different input speech, the style encoder trained to disentangle linguistic content information from the latent representation of speaking style; and a decoder configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech; wherein the content encoder comprises: one or more convolutional layers configured to receive the input speech as input and generate an initial discrete per-timestep latent representation of the linguistic content; and a vector quantization (VQ) layer configured to apply an information bottleneck with straight-through gradients on each initial discrete per-timestep latent representation of the linguistic content to generate the latent representation of linguistic content as a sequence of latent variables representing the linguistic content from the input speech; and wherein the style encoder comprises: one or more convolutional layers configured to receive the input speech as input; and a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of speaking style; and wherein the content encoder and the style encoder are trained using a mutual information loss to minimize mutual information captured in the latent representations of linguistic content and speaking style. 2. The model of claim 1 , wherein the content encoder generates the latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the input speech. 3. The model of claim 1 , wherein the content encoder is trained using a content VQ loss based on the latent representations of linguistic content generated for each timestep, the content VQ loss encouraging the content encoder to minimize a distance between an output and a nearest codebook. 4. The model of claim 1 , wherein: during training, the global style latent variable is sampled from a mean and variance of style latent variables predicted by the style encoder; and during inference, the global style latent variable is sampled from the mean of the global latent style variables predicted by the style encoder. 5. The model of claim 1 , wherein the style encoder is trained using a style regularization loss based on a mean and variance of style latent variables predicted by the style encoder, the style encoder using the style regularization loss to minimize a Kullback-Leibler (KL) divergence between a Gaussian posterior with a unit Gaussian prior. 6. The model of claim 1 , wherein the decoder is configured to: receive, as input, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same input speech; and generate, as output, the output speech comprising a reconstruction of the input speech. 7. The model of claim 6 , wherein the model is trained using a reconstruction loss between the input speech and the reconstruction of the input speech output from the decoder. 8. The model of claim 1 , wherein the decoder is configured to: receive, as input, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the different input speech; and generate, as output, the output speech comprising linguistic content information specified by the input speech and speaking style information specified by the different input speech. 9. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving input speech; processing, using a content encoder, the input speech to generate a latent representation of linguistic content for the input speech, wherein the content encoder is trained to disentangle speaking style information from the latent representation of linguistic content; processing, using a style encoder, the same or different input speech to generate a latent representation of speaking style for the same or different input speech, wherein the style encoder is trained to disentangle linguistic content information from the latent representation of speaking style; and processing, using a decoder, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech to generate output speech; wherein the content encoder comprises: one or more convolutional layers configured to receive the input speech as input and generate an initial discrete per-timestep latent representation of the linguistic content; and a vector quantization (VQ) layer configured to apply an information bottleneck with straight-through gradients on each initial discrete per-timestep latent representation of the linguistic content to generate the latent representation of linguistic content as a sequence of latent variables representing the linguistic content from the input speech; and wherein the style encoder comprises: one or more convolutional layers configured to receive the input speech as input; and a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of speaking style; and wherein the content encoder and the style encoder are trained using a mutual information loss to minimize mutual information captured in the latent representations of linguistic content and speaking style. 10. The computer-implemented method of claim 9 , wherein processing the input speech to generate the latent representation of linguistic content comprises processing the input speech to generate the latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the input speech. 11. The computer-implemented method of claim 9 , wherein the content encoder is trained using a content VQ loss based on the latent representations of linguistic content generated for each timestep, the content VQ loss encouraging the content encoder to minimize a distance between an output and a nearest codebook. 12. The computer-implemented method of claim 9 , wherein the operations further comprise: during training, sampling the global style latent variable from a mean and variance of style latent variables predicted by the style encoder; and during inference, sampling the global style latent variable from the mean of the global latent style variables predicted by the style encoder. 13. The computer-implemented method of claim 9 , wherein the style encoder is trained using a style regularization loss based on a mean and variance of style latent variables predicted by the style encoder, t

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Generative networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12027151B2 cover?
A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic co…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 02 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).