Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy

US11545134B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11545134-B1
Application numberUS-201916709792-A
CountryUS
Kind codeB1
Filing dateDec 10, 2019
Priority dateDec 10, 2019
Publication dateJan 3, 2023
Grant dateJan 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request, using a machine learning model to speech from background noise in audio of an audio track of the audio/visual file, extracting speech segments from the audio track of the audio/visual file and annotating the speech segments with a plurality of speaker identifiers, transcribing the extracted speech segments into a transcript containing text and timing information, using an artificial neural machine translation model to machine translate the transcript into a target language, wherein using the artificial neural machine translation model to machine translate the transcript into the target language comprises biasing the artificial neural machine translation model to generate text segments in the target language of desired lengths for inclusion in the translated transcript, prosodically aligning the translated transcript, the translated transcript comprising the generated text segments in the target language of desired lengths, extracting paralinguistic information from the extracted speech segments and the prosodically aligned translated transcript, determining, based upon the extracted speech segments and paralinguistic information, a plurality of trained machine learning models to use for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, a different respective one of the plurality of trained machine learning models is determined for the identified speaker, generating a spoken version of the prosodically aligned translated transcript using the plurality of trained machine learning models determined for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, the respective trained machine learning model determined for the identified speaker is used to generate speech segments corresponding to the identified speaker for inclusion in the spoken version of the prosodically aligned translated transcript, incorporating the background noise into the spoken version of the prosodically aligned translated transcript, generating a modified audio track by replacing the extracted speech segments from the audio track of the audio/visual file with the spoken version of the prosodically aligned translated transcript, and providing the audio/visual file with the modified audio track to the requester. 2. The computer-implemented method of claim 1 , wherein the paralinguistic information includes at least one of timbre, pitch, length of sounds, and loudness. 3. The computer-implemented method of claim 1 , wherein the extracted speech segments are annotated with a length and the machine translating is to generate translated, extracted speech segments of similar length. 4. A computer-implemented method comprising: receiving a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request, using a machine learning model to separate speech from background noise in audio of an audio track of the audio/visual file, using an artificial neural machine translation model to machine translate speech segments of the audio track of the audio/visual file into a target language to generate a translated transcript, wherein using the artificial neural machine translation model to translate the speech segments comprises biasing the artificial neural machine translation model to generate text segments in the target language of desired lengths for inclusion in the translated transcript, prosodically aligning the translated transcript, determining, based upon the speech segments, a plurality of trained machine learning models for a plurality of identified speakers of the audio track, wherein, for each identified speaker of the plurality of identified speakers, a different respective one of plurality of trained machine learning models is determined for the identified speaker, generating a spoken version of the prosodically aligned translated transcript using the plurality of trained machine learning models determined for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, the respective trained machine learning model determined for the identified speaker is used to generate speech segments corresponding to the identified speaker for inclusion in the spoken version of the prosodically aligned translated transcript, incorporating the background noise into the spoken version of the prosodically aligned translated transcript, generating a modified audio track by replacing the speech segments from the audio track of the audio/visual file with the generated spoken version of the prosodically aligned translated transcript to generate a modified audio track, and providing the audio/visual file with the modified audio track to the requester. 5. The computer-implemented method of claim 4 , further comprising: extracting the speech segments from an audio track of the audio/visual file and annotating the speech segments with a speaker identifier. 6. The computer-implemented method of claim 4 , further comprising: extracting corresponding text for the speech segments from captioning data of the audio/visual file prior. 7. The computer-implemented method of claim 4 , wherein the speech segments are annotated with a length and the machine translating is to generate translated speech segments of similar length. 8. The computer-implemented method of claim 4 , wherein determining a machine learning model per speaker of the audio track comprises: training a machine learning model per speaker based at least in part on utterances. 9. The computer-implemented method of claim 4 , wherein determining a machine learning model per speaker comprises: identifying an existing machine learning model per speaker based on identification of the speaker. 10. The computer-implemented method of claim 9 , wherein identifying an existing machine learning model per identified speaker comprises: utilizing facial information from a video portion of the audio/visual file to detect the speaker; and querying for the existing machine learning model corresponding to the detected speaker. 11. The computer-implemented method of claim 4 , wherein prosodically aligning the translated transcript includes using facial information from a video portion of the audio/visual file to determine if the speaker's face or mouth is visible. 12. The computer-implemented method of claim 4 , further comprising: extracting paralinguistic information from the speech segments and prosodically aligned translated transcript. 13. The computer-implemented method of claim 4 , wherein the audio/visual file is a container. 14. A system comprising: storage for an audio/video file; and one or more electronic devices to implement a dubbing service, the dubbing service including instructions that upon execution cause the dubbing service to: receive a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request to, separate speech from background or non-speech noise in audio of an audio track of the audio/visual file using a machine learning model, extract speech segments from the audio track of the audio/visual file associated with a plurality of identified speakers, use an artificial neural machine translation model to machine translate the extracted speech segments into a target language including biasing the artificial neural machine

Assignees

Inventors

Classifications

  • Speech synthesis; Text to speech systems · CPC title

  • Detection; Localisation; Normalisation · CPC title

  • Training · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11545134B1 cover?
Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a ma…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).