System and method for animated lip synchronization
US-2018253881-A1 · Sep 6, 2018 · US
US10699705B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10699705-B2 |
| Application number | US-201816016418-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 22, 2018 |
| Priority date | Jun 22, 2018 |
| Publication date | Jun 30, 2020 |
| Grant date | Jun 30, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed systems and methods predict visemes from an audio sequence. A viseme-generation application accesses a first set of training data that includes a first audio sequence representing a sentence spoken by a first speaker and a sequence of visemes. Each viseme is mapped to a respective audio sample of the first audio sequence. The viseme-generation application creates a second set of training data adjusting a second audio sequence spoken by a second speaker speaking the sentence such that the second and first sequences have the same length and at least one phoneme occurs at the same time stamp in the first sequence and in the second sequence. The viseme-generation application maps the sequence of visemes to the second audio sequence and trains a viseme prediction model to predict a sequence of visemes from an audio sequence.
Opening claim text (preview).
What is claimed is: 1. A method of predicting visemes from an audio sequence, the method comprising: accessing a first set of training data comprising: (i) a first audio sequence of samples representing a sentence spoken by a first speaker and having a first length, wherein the audio sequence represents a sequence of phonemes, and (ii) a sequence of visemes, wherein each viseme is mapped to a respective audio sample of the first audio sequence of samples; creating a second set of training data by: accessing a second audio sequence of samples representing the same sentence spoken by a second speaker and having a second length, wherein the second audio sequence of samples comprises the sequence of phonemes; adjusting the second audio sequence of samples such that (i) a second sequence length is equal to the first length and (ii) at least one phoneme occurs at an identical time stamp in the first audio sequence of samples and in the second audio sequence of samples; mapping the sequence of visemes to the second audio sequence of samples; and training a viseme prediction model to predict a sequence of visemes from the first set of training data and the second set of training data. 2. The method of claim 1 , wherein training the viseme prediction model comprises: determining a feature vector for each sample of the respective audio sequence of each set of training data; providing the feature vectors to the viseme prediction model; receiving, from the viseme prediction model, a predicted viseme; calculating a loss function by calculating a difference between the predicted viseme and an expected viseme; and adjusting internal parameters of the viseme prediction model to minimize the loss function. 3. The method of claim 2 , wherein the feature vector comprises: a set of mel-frequency cepstrum coefficients for the samples, a logarithm of a mean energy of the samples, and a first temporal derivative of the samples. 4. The method of claim 1 , further comprising: accessing a plurality of speech samples corresponding to a time period, wherein a present subset of the speech samples corresponds to a present time period and a past subset of the speech samples corresponds to a past time period; computing a feature vector representing the plurality of speech samples; determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to the viseme prediction model trained to predict a viseme from a plurality of predetermined visemes, wherein the sesquence of predicted visemes is based on the past subset and the present subset; and providing a visualization corresponding to the sequence of predicted visemes, wherein providing the visualization comprises: accessing a list of visualizations, mapping the viseme to a listed visualization, and configuring a display device to display the listed visualization. 5. The method of claim 4 , further comprising: mapping each of the sequence of visemes to a frame rate; determining that a particular viseme of the sequence of visemes corresponds to a frame of video; and removing the particular viseme from the sequence of visemes. 6. The method of claim 4 , further comprising: mapping each of the sequence of visemes to a frame rate; delaying an output of the sequence of predicted visemes by a predetermined number of frames; and responsive to determining that (i) a current frame includes a particular viseme and (ii) a subsequent frame and a previous frame lack the particular viseme, mapping the viseme of the previous frame to the current frame. 7. The method of claim 4 , further comprising: mapping each of the sequence of visemes to a frame rate; and representing the sequence of visemes on a graphical timeline according to the frame rate. 8. A system comprising: a non-transitory computer-readable medium storing computer-executable program instructions and a processing device communicatively coupled to the non-transitory computer-readable medium for executing the computer-executable program instructions, wherein executing the computer-executable program instructions configures the processing device to perform operations comprising: accessing a plurality of speech samples corresponding to a time period, wherein a present subset of the speech samples corresponds to a present time period and a past subset of the speech samples corresponds to a past time period; computing a feature vector representing the plurality of speech samples; determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to a viseme prediction model trained with a second training data set comprising a second audio sequence spoken by a second speaker and a sequence of visemes, wherein the second training data set is created by mapping the second audio sequence to a first audio sequence; and providing a visualization corresponding to the sequence of predicted visemes, wherein providing the visualization comprises: accessing a list of visualizations, mapping each viseme of the predicted sequence of visemes to a listed visualization, and configuring a display device to display the listed visualization. 9. The system of claim 8 , further comprising: increasing an amplitude of each of the plurality of speech samples; determining, from the plurality of speech samples, a speech sample that has an amplitude greater than a threshold; and reducing the amplitude of the speech sample. 10. The system of claim 8 , wherein computing the feature vector further comprises: calculating a set of mel-frequency cepstrum coefficients for the plurality of speech samples, calculating a logarithm of a mean energy of the plurality of speech samples, and calculating a first temporal derivative of the plurality of speech samples. 11. The system of claim 8 , the operations further comprising: mapping each of the sequence of visemes to a frame rate; delaying an output of the sequence of predicted visemes by a predetermined number of frames; and responsive to determining that (i) a current frame includes a particular viseme and (ii) a subsequent frame and a previous frame lack the particular viseme, mapping a viseme of the previous frame to the current frame. 12. The system of claim 8 , the operations further comprising: mapping the sequence of predicted visemes to a frame rate; and representing the sequence of predicted visemes on a graphical timeline according to the frame rate. 13. A non-transitory computer-readable storage medium storing computer-executable program instructions, wherein when executed by a processing device, the computer-executable program instructions cause the processing device to perform operations comprising: accessing a first set of training data comprising: (i) a first audio sequence representing a sentence spoken by a first speaker and having a first length, wherein the first audio sequence represents a sequence of phonemes and has a first length, and (ii) a sequence of visemes, wherein each viseme is mapped to a respective audio sample of the first audio sequence; creating a second set of training data by: accessing a second audio sequence representing the sentence spoken by a second speaker and having a second length, wherein the second audio sequence comprises the sequence of phonemes; adjusting the first audio sequence such that (i) the first length is equal to the second length and (ii) at least one phoneme occurs at an identical time stamp in the first audio sequence and in the second audio sequence; mapping the sequence of visemes to the adjusted first audio sequence; and
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
Learning methods · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.