Method and device for generating speech recognition training set
US-2024233708-A1 · Jul 11, 2024 · US
US2017040017A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017040017-A1 |
| Application number | US-201514820410-A |
| Country | US |
| Kind code | A1 |
| Filing date | Aug 6, 2015 |
| Priority date | Aug 6, 2015 |
| Publication date | Feb 9, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
There are provided systems and methods for generating a visually consistent alternative audio for redubbing visual speech using a processor configured to sample a dynamic viseme sequence corresponding to a given utterance by a speaker in a video, identify a plurality of phonemes corresponding to the dynamic viseme sequence, construct a graph of the plurality of phonemes that synchronize with a sequence of lip movements of a mouth of the speaker in the dynamic viseme sequence, use the graph to generate an alternative phrase that substantially matches the sequence of lip movements of the mouth of the speaker in the video.
Opening claim text (preview).
What is claimed is: 1 . A system for redubbing of a video, the system comprising: a memory for storing a redubbing application; a processor configured to execute the reducing application to: sample a dynamic viseme sequence corresponding to a given utterance by a speaker in the video; identify a plurality of phonemes corresponding to the dynamic viseme sequence; construct a graph of the plurality of phonemes corresponding to the dynamic viseme sequence; generate, using the graph of the plurality of phonemes, a first set including at least one word that substantially matches a sequence of lip movements of a mouth of the speaker in the video; and construct a second set including at least one alternative phrase, the at least one alternative phrase formed by the at least one word of the first set that substantially matches the sequence of lip movements of the mouth of the speaker in the video. 2 . The system of claim 1 , further comprising a display, wherein the processor is further configured to display the video synchronized with a candidate alternative phrase from the second set to replace an original audio of the video. 3 . The system of claim 1 , wherein the first set includes valid words in a target language. 4 . The system of claim 1 , wherein the second set includes valid sentences in a target language. 5 . The system of claim 4 , wherein the target language is a different language than an original language of the video. 6 . The system of claim 1 , wherein the processor is further configured to: select a candidate alternative phrase from the second set; and insert the candidate alternative phrase as a substitute audio for the dynamic viseme sequence. 7 . The system of claim 1 , wherein the processor is further configured to: score each alternative phrase of the plurality of alternative phrases in the second set based on how closely each alternative phrase matches the sequence of lip movements of the mouth of the speaker in the video; and rank the alternative phrases based on the score. 8 . The system of claim 1 , further comprising a user interface, wherein the processor is further configured to: receive, from a user via the user interface, a suggested alternative phrase; transcribe the suggested alternative phrase into an ordered phoneme list; compare the ordered phoneme list to the dynamic viseme sequence; and score how well the suggested alternative phrase matches the lip movements of the mouth of the speaker in the video corresponding to the dynamic viseme sequence. 9 . The system of claim 8 , wherein the processor is further configured to: suggest a synonym of a word in the alternative phrase, wherein replacing the word in the alternative phrase with the synonym will increase the score. 10 . The system of claim 1 , wherein the first set is a complete set including every phoneme that corresponds to the sequence of dynamic visemes. 11 . A method for use by a system having a memory and a processor for redubbing of a video, the method comprising: sampling, using the processor, a dynamic viseme sequence corresponding to a given utterance by a speaker in the video; identifying, using the processor, a plurality of phonemes corresponding to the dynamic viseme sequence; constructing, using the processor, a graph of the plurality of phonemes corresponding to the dynamic viseme sequence; generating, using the processor, a first set including at least one word that substantially matches a sequence of lip movements of a mouth of the speaker in the video using the graph of the plurality of phonemes; and constructing, using the processor, a second set including at least one alternative phrase, the at least one alternative phrase formed by the at least one word of the first set that substantially matches the sequence of lip movements of the mouth of the speaker in the video. 12 . The method of claim 11 , wherein the system further comprises a display, the method further comprising: displaying the video synchronized with an alternative phrase from the second set to replace an original audio of the video on the display. 13 . The method of claim 11 , wherein the first set includes valid words in a target language. 14 . The method of claim 11 , wherein the second set includes valid sentences in a target language. 15 . The method of claim 14 , wherein the target language is a different language than an original language of the video. 16 . The method of claim 11 , wherein the second set includes a plurality of alternative phrases, the method further comprising: selecting, using the processor, a candidate alternative phrase from the second set; and inserting, using the processor, the candidate alternative phrase as a substitute audio for the dynamic viseme sequence. 17 . The method of claim 11 , wherein the second set includes a plurality of alternative phrases, the method further comprising: scoring, using the processor, each alternative phrase of the plurality of alternative phrases in the second set; and ranking, using the processor, each alternative phrase of the plurality of alternative phrases in the second set according to how well the pronounced phonemes of each alternative phrase of the plurality of alternative phrases match the dynamic viseme sequence. 18 . The method of claim 11 , wherein the system includes a user interface, the method further comprising: receiving, from a user via the user interface, a suggested alternative phrase; transcribing, using the processor, the suggested alternative phrase into an ordered phoneme list; comparing, using the processor, the ordered phoneme list to the dynamic viseme sequence; and scoring, using the processor, how well the suggested alternative phrase matches the lip movements of the mouth of the speaker in the video corresponding to the dynamic viseme sequence. 19 . The method of claim 18 , further comprising: suggesting, using the processor, a synonym of a word in the suggested alternative phrase, wherein replacing the word of the suggested alternative phrase with the synonym will increase the score. 20 . The method of claim 11 , wherein the first set is a complete set including every phoneme that corresponds to the sequence of dynamic visemes.
for processing of video signals · CPC title
Transforming into visible information · CPC title
Synthesis of the lips movements from speech, e.g. for talking heads · CPC title
for synchronising with other signals, e.g. video signals · CPC title
using position of the lips, movement of the lips or face analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.