Normalizing flows with neural splines for high-quality speech synthesis
US-2024038212-A1 · Feb 1, 2024 · US
US12562148B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12562148-B1 |
| Application number | US-202318128766-A |
| Country | US |
| Kind code | B1 |
| Filing date | Mar 30, 2023 |
| Priority date | Feb 9, 2023 |
| Publication date | Feb 24, 2026 |
| Grant date | Feb 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An expressive speech translation system may process source speech in a source language and output synthesized speech in a target language while retaining vocal performance characteristics such as intonation, emphasis, rhythm, style, and/or emotion. The system may receive a transcript of the source speech, translate it, and generate transcript data. To generate the synthesized speech, the system may process the transcript data with a language embedding representing language-dependent speech characteristics of the target language, a speaker embedding representing speaker-dependent voice identity characteristics of a speaker, and a performance embedding representing the vocal performance characteristics of the source speech. The system may control the duration of segments of the synthesized speech to better align with corresponding segments of the source speech for the purpose of dubbing multimedia content with synthesized speech in a language different from that of the original audio.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method comprising: receiving first multimedia content including video data and first audio data representing first speech spoken by a first speaker in source language; receiving first speaker embedding data representing first voice identity characteristics of a second speaker different from the first speaker; processing the first audio data using a first encoder to generate first performance embedding data representing first vocal performance characteristics of the first speech; receiving first data representing a first transcript to be output as synthesized speech, wherein the first data is in a target language different from the source language; receiving language embedding data representing language-dependent speech characteristics of the target language; processing the first data using a second encoder, the first performance embedding data, and the language embedding data to generate first transcript embedding data, the first transcript embedding data corresponding to a first duration; receiving duration data indicating that the first speech corresponds to a second duration different from the first duration; generating, using the first transcript embedding data and the duration data, second transcript embedding data corresponding to the second duration; processing the second transcript embedding data using a first transformation and the first speaker embedding data to generate acoustic embedding data corresponding to the first voice identity characteristics, the first transformation representing an invertible flow; processing the acoustic embedding data using a decoder and the first speaker embedding data to generate second audio data representing the synthesized speech in the target language, the synthesized speech having the first voice identity characteristics, the first vocal performance characteristics, and the second duration; and generating, using the video data and the second audio data, second multimedia content representing the video data dubbed with the second audio data. 2 . The computer-implemented method of claim 1 , wherein the first transcript embedding data includes a first transcript embedding corresponding to a first representation of the synthesized speech and a second transcript embedding corresponding to a second representation of the synthesized speech, further comprising: determining that the first transcript embedding corresponds to a first predicted duration; determining that the second transcript embedding corresponds to a second predicted duration; determining, using the duration data, a first modified duration for the first transcript embedding data; determining, using the duration data, a second modified duration for the second transcript embedding data; determining that the first modified duration corresponds to a first number of audio frames; determining that second first modified duration corresponds to a second number of audio frames; generating a first plurality of transcript embeddings using the first transcript embedding and the first number; generating a second plurality of transcript embeddings using the second transcript embedding and the second number; and generating the second transcript embedding data using the first plurality of transcript embeddings and the second plurality of transcript embeddings. 3 . The computer-implemented method of claim 1 , further comprising: processing the first audio data using a first component to generate third audio data representing the first audio data with at least a portion of noise content removed; processing the third audio data using the first encoder to generate second performance embedding data; processing fourth audio data using the first component to generate fifth audio data representing a noise content of the fourth audio data, the fourth audio data representing speech recorded in a low-noise environment; processing the fifth audio data using a third encoder to generate noise embedding data; and determining the first performance embedding data using the second performance embedding data and the noise embedding data. 4 . The computer-implemented method of claim 1 , further comprising: receiving third audio data representing sample speech from a training dataset; processing the third audio data using a third encoder to generate second speaker embedding data representing voice identity characteristics of a speaker of the sample speech; processing the third audio data using a fourth encoder and the second speaker embedding data to generate acoustic embedding data representing the sample speech with voice identity characteristics retained; processing the acoustic embedding data using a second transformation and the second speaker embedding data to generate first data representing the sample speech with voice identity characteristics suppressed; determining second data representing a second transcript of the sample speech; and training the second transformation using the first data and the second data to determine a third transformation, wherein the first transformation represents an inverse of the third transformation. 5 . A computer-implemented method comprising: receiving first multimedia content including video data and first audio data representing first speech in source language; receiving first data representing first voice identity characteristics for synthesizing second speech; determining, using the first audio data, second data representing first vocal performance characteristics of the first speech; receiving third data representing a first transcript of the second speech in a target language; determining fourth data using the third data and the second data, the fourth data representing the first transcript and corresponding to the first vocal performance characteristics; generating, using the fourth data, the first data, and a machine learning model, fifth data representing acoustic embeddings for generating the second speech corresponding to the first voice identity characteristics; determining, using the fifth data, second audio data representing the second speech; and generating, using the video data and the second audio data, second multimedia content representing the video data dubbed with the second audio data. 6 . The computer-implemented method of claim 5 , wherein the fourth data includes a first transcript embedding corresponding to a first representation of the second speech and a second transcript embedding corresponding to a second representation of the second speech, and the fourth data corresponds to a first duration, the method further comprising: receiving duration data indicating that the first speech corresponds to a second duration different from the first duration; determining that the first transcript embedding corresponds to a first predicted duration; determining that the second transcript embedding corresponds to a second predicted duration; determining, using the duration data, a first modified duration for the first transcript embedding; determining, using the duration data, a second modified duration for the second transcript embedding; determining that the first modified duration corresponds to a first number of audio frames; determining that second first modified duration corresponds to a second number of audio frames; generating a first plurality of transcript embeddings using the first transcript embedding and the first number; generating a second plurality of transcript embeddings using the second transcript embedding and the second number; and generating the second data using the first plurality of transcript embeddings and the second plurality of transcript embeddings. 7 . The computer-implemented method of claim 5 , f
Speech to text systems (G10L15/08 takes precedence) · CPC title
Language recognition · CPC title
for estimating an emotional state · CPC title
using artificial neural networks · CPC title
involving special audio data, e.g. different tracks for different languages · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.