Speech translation with performance characteristics

US12562148B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-12562148-B1
Application numberUS-202318128766-A
CountryUS
Kind codeB1
Filing dateMar 30, 2023
Priority dateFeb 9, 2023
Publication dateFeb 24, 2026
Grant dateFeb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An expressive speech translation system may process source speech in a source language and output synthesized speech in a target language while retaining vocal performance characteristics such as intonation, emphasis, rhythm, style, and/or emotion. The system may receive a transcript of the source speech, translate it, and generate transcript data. To generate the synthesized speech, the system may process the transcript data with a language embedding representing language-dependent speech characteristics of the target language, a speaker embedding representing speaker-dependent voice identity characteristics of a speaker, and a performance embedding representing the vocal performance characteristics of the source speech. The system may control the duration of segments of the synthesized speech to better align with corresponding segments of the source speech for the purpose of dubbing multimedia content with synthesized speech in a language different from that of the original audio.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method comprising: receiving first multimedia content including video data and first audio data representing first speech spoken by a first speaker in source language; receiving first speaker embedding data representing first voice identity characteristics of a second speaker different from the first speaker; processing the first audio data using a first encoder to generate first performance embedding data representing first vocal performance characteristics of the first speech; receiving first data representing a first transcript to be output as synthesized speech, wherein the first data is in a target language different from the source language; receiving language embedding data representing language-dependent speech characteristics of the target language; processing the first data using a second encoder, the first performance embedding data, and the language embedding data to generate first transcript embedding data, the first transcript embedding data corresponding to a first duration; receiving duration data indicating that the first speech corresponds to a second duration different from the first duration; generating, using the first transcript embedding data and the duration data, second transcript embedding data corresponding to the second duration; processing the second transcript embedding data using a first transformation and the first speaker embedding data to generate acoustic embedding data corresponding to the first voice identity characteristics, the first transformation representing an invertible flow; processing the acoustic embedding data using a decoder and the first speaker embedding data to generate second audio data representing the synthesized speech in the target language, the synthesized speech having the first voice identity characteristics, the first vocal performance characteristics, and the second duration; and generating, using the video data and the second audio data, second multimedia content representing the video data dubbed with the second audio data. 2 . The computer-implemented method of claim 1 , wherein the first transcript embedding data includes a first transcript embedding corresponding to a first representation of the synthesized speech and a second transcript embedding corresponding to a second representation of the synthesized speech, further comprising: determining that the first transcript embedding corresponds to a first predicted duration; determining that the second transcript embedding corresponds to a second predicted duration; determining, using the duration data, a first modified duration for the first transcript embedding data; determining, using the duration data, a second modified duration for the second transcript embedding data; determining that the first modified duration corresponds to a first number of audio frames; determining that second first modified duration corresponds to a second number of audio frames; generating a first plurality of transcript embeddings using the first transcript embedding and the first number; generating a second plurality of transcript embeddings using the second transcript embedding and the second number; and generating the second transcript embedding data using the first plurality of transcript embeddings and the second plurality of transcript embeddings. 3 . The computer-implemented method of claim 1 , further comprising: processing the first audio data using a first component to generate third audio data representing the first audio data with at least a portion of noise content removed; processing the third audio data using the first encoder to generate second performance embedding data; processing fourth audio data using the first component to generate fifth audio data representing a noise content of the fourth audio data, the fourth audio data representing speech recorded in a low-noise environment; processing the fifth audio data using a third encoder to generate noise embedding data; and determining the first performance embedding data using the second performance embedding data and the noise embedding data. 4 . The computer-implemented method of claim 1 , further comprising: receiving third audio data representing sample speech from a training dataset; processing the third audio data using a third encoder to generate second speaker embedding data representing voice identity characteristics of a speaker of the sample speech; processing the third audio data using a fourth encoder and the second speaker embedding data to generate acoustic embedding data representing the sample speech with voice identity characteristics retained; processing the acoustic embedding data using a second transformation and the second speaker embedding data to generate first data representing the sample speech with voice identity characteristics suppressed; determining second data representing a second transcript of the sample speech; and training the second transformation using the first data and the second data to determine a third transformation, wherein the first transformation represents an inverse of the third transformation. 5 . A computer-implemented method comprising: receiving first multimedia content including video data and first audio data representing first speech in source language; receiving first data representing first voice identity characteristics for synthesizing second speech; determining, using the first audio data, second data representing first vocal performance characteristics of the first speech; receiving third data representing a first transcript of the second speech in a target language; determining fourth data using the third data and the second data, the fourth data representing the first transcript and corresponding to the first vocal performance characteristics; generating, using the fourth data, the first data, and a machine learning model, fifth data representing acoustic embeddings for generating the second speech corresponding to the first voice identity characteristics; determining, using the fifth data, second audio data representing the second speech; and generating, using the video data and the second audio data, second multimedia content representing the video data dubbed with the second audio data. 6 . The computer-implemented method of claim 5 , wherein the fourth data includes a first transcript embedding corresponding to a first representation of the second speech and a second transcript embedding corresponding to a second representation of the second speech, and the fourth data corresponds to a first duration, the method further comprising: receiving duration data indicating that the first speech corresponds to a second duration different from the first duration; determining that the first transcript embedding corresponds to a first predicted duration; determining that the second transcript embedding corresponds to a second predicted duration; determining, using the duration data, a first modified duration for the first transcript embedding; determining, using the duration data, a second modified duration for the second transcript embedding; determining that the first modified duration corresponds to a first number of audio frames; determining that second first modified duration corresponds to a second number of audio frames; generating a first plurality of transcript embeddings using the first transcript embedding and the first number; generating a second plurality of transcript embeddings using the second transcript embedding and the second number; and generating the second data using the first plurality of transcript embeddings and the second plurality of transcript embeddings. 7 . The computer-implemented method of claim 5 , f

Assignees

Inventors

Classifications

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Language recognition · CPC title

  • for estimating an emotional state · CPC title

  • using artificial neural networks · CPC title

  • involving special audio data, e.g. different tracks for different languages · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12562148B1 cover?
An expressive speech translation system may process source speech in a source language and output synthesized speech in a target language while retaining vocal performance characteristics such as intonation, emphasis, rhythm, style, and/or emotion. The system may receive a transcript of the source speech, translate it, and generate transcript data. To generate the synthesized speech, the system…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification H04N21/8106. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).