What technology area does this patent fall under?

Primary CPC classification G10L13/10. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy

US11545134B1 · US · B1

Patent metadata
Field	Value
Publication number	US-11545134-B1
Application number	US-201916709792-A
Country	US
Kind code	B1
Filing date	Dec 10, 2019
Priority date	Dec 10, 2019
Publication date	Jan 3, 2023
Grant date	Jan 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request, using a machine learning model to speech from background noise in audio of an audio track of the audio/visual file, extracting speech segments from the audio track of the audio/visual file and annotating the speech segments with a plurality of speaker identifiers, transcribing the extracted speech segments into a transcript containing text and timing information, using an artificial neural machine translation model to machine translate the transcript into a target language, wherein using the artificial neural machine translation model to machine translate the transcript into the target language comprises biasing the artificial neural machine translation model to generate text segments in the target language of desired lengths for inclusion in the translated transcript, prosodically aligning the translated transcript, the translated transcript comprising the generated text segments in the target language of desired lengths, extracting paralinguistic information from the extracted speech segments and the prosodically aligned translated transcript, determining, based upon the extracted speech segments and paralinguistic information, a plurality of trained machine learning models to use for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, a different respective one of the plurality of trained machine learning models is determined for the identified speaker, generating a spoken version of the prosodically aligned translated transcript using the plurality of trained machine learning models determined for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, the respective trained machine learning model determined for the identified speaker is used to generate speech segments corresponding to the identified speaker for inclusion in the spoken version of the prosodically aligned translated transcript, incorporating the background noise into the spoken version of the prosodically aligned translated transcript, generating a modified audio track by replacing the extracted speech segments from the audio track of the audio/visual file with the spoken version of the prosodically aligned translated transcript, and providing the audio/visual file with the modified audio track to the requester. 2. The computer-implemented method of claim 1 , wherein the paralinguistic information includes at least one of timbre, pitch, length of sounds, and loudness. 3. The computer-implemented method of claim 1 , wherein the extracted speech segments are annotated with a length and the machine translating is to generate translated, extracted speech segments of similar length. 4. A computer-implemented method comprising: receiving a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request, using a machine learning model to separate speech from background noise in audio of an audio track of the audio/visual file, using an artificial neural machine translation model to machine translate speech segments of the audio track of the audio/visual file into a target language to generate a translated transcript, wherein using the artificial neural machine translation model to translate the speech segments comprises biasing the artificial neural machine translation model to generate text segments in the target language of desired lengths for inclusion in the translated transcript, prosodically aligning the translated transcript, determining, based upon the speech segments, a plurality of trained machine learning models for a plurality of identified speakers of the audio track, wherein, for each identified speaker of the plurality of identified speakers, a different respective one of plurality of trained machine learning models is determined for the identified speaker, generating a spoken version of the prosodically aligned translated transcript using the plurality of trained machine learning models determined for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, the respective trained machine learning model determined for the identified speaker is used to generate speech segments corresponding to the identified speaker for inclusion in the spoken version of the prosodically aligned translated transcript, incorporating the background noise into the spoken version of the prosodically aligned translated transcript, generating a modified audio track by replacing the speech segments from the audio track of the audio/visual file with the generated spoken version of the prosodically aligned translated transcript to generate a modified audio track, and providing the audio/visual file with the modified audio track to the requester. 5. The computer-implemented method of claim 4 , further comprising: extracting the speech segments from an audio track of the audio/visual file and annotating the speech segments with a speaker identifier. 6. The computer-implemented method of claim 4 , further comprising: extracting corresponding text for the speech segments from captioning data of the audio/visual file prior. 7. The computer-implemented method of claim 4 , wherein the speech segments are annotated with a length and the machine translating is to generate translated speech segments of similar length. 8. The computer-implemented method of claim 4 , wherein determining a machine learning model per speaker of the audio track comprises: training a machine learning model per speaker based at least in part on utterances. 9. The computer-implemented method of claim 4 , wherein determining a machine learning model per speaker comprises: identifying an existing machine learning model per speaker based on identification of the speaker. 10. The computer-implemented method of claim 9 , wherein identifying an existing machine learning model per identified speaker comprises: utilizing facial information from a video portion of the audio/visual file to detect the speaker; and querying for the existing machine learning model corresponding to the detected speaker. 11. The computer-implemented method of claim 4 , wherein prosodically aligning the translated transcript includes using facial information from a video portion of the audio/visual file to determine if the speaker's face or mouth is visible. 12. The computer-implemented method of claim 4 , further comprising: extracting paralinguistic information from the speech segments and prosodically aligned translated transcript. 13. The computer-implemented method of claim 4 , wherein the audio/visual file is a container. 14. A system comprising: storage for an audio/video file; and one or more electronic devices to implement a dubbing service, the dubbing service including instructions that upon execution cause the dubbing service to: receive a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request to, separate speech from background or non-speech noise in audio of an audio track of the audio/visual file using a machine learning model, extract speech segments from the audio track of the audio/visual file associated with a plurality of identified speakers, use an artificial neural machine translation model to machine translate the extracted speech segments into a target language including biasing the artificial neural machine

Assignees

Amazon Tech Inc

Inventors

Classifications

G10L13/00
Speech synthesis; Text to speech systems · CPC title
G06V40/161
Detection; Localisation; Normalisation · CPC title
G10L15/063
Training · CPC title
G10L15/26
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

View patent family 84693400

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11545134B1 cover?: Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a ma…
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Generating videos with a character indicating a region of an image

Automatic dubbing method and apparatus

System and method for rendering textual messages using customized natural voice

System and method for audio dubbing and translation of a video

Frequently asked questions