Generating videos with a character indicating a region of an image
US-2020213680-A1 · Jul 2, 2020 · US
US11545134B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11545134-B1 |
| Application number | US-201916709792-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 10, 2019 |
| Priority date | Dec 10, 2019 |
| Publication date | Jan 3, 2023 |
| Grant date | Jan 3, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request, using a machine learning model to speech from background noise in audio of an audio track of the audio/visual file, extracting speech segments from the audio track of the audio/visual file and annotating the speech segments with a plurality of speaker identifiers, transcribing the extracted speech segments into a transcript containing text and timing information, using an artificial neural machine translation model to machine translate the transcript into a target language, wherein using the artificial neural machine translation model to machine translate the transcript into the target language comprises biasing the artificial neural machine translation model to generate text segments in the target language of desired lengths for inclusion in the translated transcript, prosodically aligning the translated transcript, the translated transcript comprising the generated text segments in the target language of desired lengths, extracting paralinguistic information from the extracted speech segments and the prosodically aligned translated transcript, determining, based upon the extracted speech segments and paralinguistic information, a plurality of trained machine learning models to use for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, a different respective one of the plurality of trained machine learning models is determined for the identified speaker, generating a spoken version of the prosodically aligned translated transcript using the plurality of trained machine learning models determined for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, the respective trained machine learning model determined for the identified speaker is used to generate speech segments corresponding to the identified speaker for inclusion in the spoken version of the prosodically aligned translated transcript, incorporating the background noise into the spoken version of the prosodically aligned translated transcript, generating a modified audio track by replacing the extracted speech segments from the audio track of the audio/visual file with the spoken version of the prosodically aligned translated transcript, and providing the audio/visual file with the modified audio track to the requester. 2. The computer-implemented method of claim 1 , wherein the paralinguistic information includes at least one of timbre, pitch, length of sounds, and loudness. 3. The computer-implemented method of claim 1 , wherein the extracted speech segments are annotated with a length and the machine translating is to generate translated, extracted speech segments of similar length. 4. A computer-implemented method comprising: receiving a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request, using a machine learning model to separate speech from background noise in audio of an audio track of the audio/visual file, using an artificial neural machine translation model to machine translate speech segments of the audio track of the audio/visual file into a target language to generate a translated transcript, wherein using the artificial neural machine translation model to translate the speech segments comprises biasing the artificial neural machine translation model to generate text segments in the target language of desired lengths for inclusion in the translated transcript, prosodically aligning the translated transcript, determining, based upon the speech segments, a plurality of trained machine learning models for a plurality of identified speakers of the audio track, wherein, for each identified speaker of the plurality of identified speakers, a different respective one of plurality of trained machine learning models is determined for the identified speaker, generating a spoken version of the prosodically aligned translated transcript using the plurality of trained machine learning models determined for the plurality of identified speakers, wherein, for each identified speaker of the plurality of identified speakers, the respective trained machine learning model determined for the identified speaker is used to generate speech segments corresponding to the identified speaker for inclusion in the spoken version of the prosodically aligned translated transcript, incorporating the background noise into the spoken version of the prosodically aligned translated transcript, generating a modified audio track by replacing the speech segments from the audio track of the audio/visual file with the generated spoken version of the prosodically aligned translated transcript to generate a modified audio track, and providing the audio/visual file with the modified audio track to the requester. 5. The computer-implemented method of claim 4 , further comprising: extracting the speech segments from an audio track of the audio/visual file and annotating the speech segments with a speaker identifier. 6. The computer-implemented method of claim 4 , further comprising: extracting corresponding text for the speech segments from captioning data of the audio/visual file prior. 7. The computer-implemented method of claim 4 , wherein the speech segments are annotated with a length and the machine translating is to generate translated speech segments of similar length. 8. The computer-implemented method of claim 4 , wherein determining a machine learning model per speaker of the audio track comprises: training a machine learning model per speaker based at least in part on utterances. 9. The computer-implemented method of claim 4 , wherein determining a machine learning model per speaker comprises: identifying an existing machine learning model per speaker based on identification of the speaker. 10. The computer-implemented method of claim 9 , wherein identifying an existing machine learning model per identified speaker comprises: utilizing facial information from a video portion of the audio/visual file to detect the speaker; and querying for the existing machine learning model corresponding to the detected speaker. 11. The computer-implemented method of claim 4 , wherein prosodically aligning the translated transcript includes using facial information from a video portion of the audio/visual file to determine if the speaker's face or mouth is visible. 12. The computer-implemented method of claim 4 , further comprising: extracting paralinguistic information from the speech segments and prosodically aligned translated transcript. 13. The computer-implemented method of claim 4 , wherein the audio/visual file is a container. 14. A system comprising: storage for an audio/video file; and one or more electronic devices to implement a dubbing service, the dubbing service including instructions that upon execution cause the dubbing service to: receive a request, sent by a requester, to generate dubbed speech for an audio/visual file; and in response to the request to, separate speech from background or non-speech noise in audio of an audio track of the audio/visual file using a machine learning model, extract speech segments from the audio track of the audio/visual file associated with a plurality of identified speakers, use an artificial neural machine translation model to machine translate the extracted speech segments into a target language including biasing the artificial neural machine
Speech synthesis; Text to speech systems · CPC title
Detection; Localisation; Normalisation · CPC title
Training · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.