System and method for animated lip synchronization
US-2018253881-A1 · Sep 6, 2018 · US
US11211060B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11211060-B2 |
| Application number | US-202016887418-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 29, 2020 |
| Priority date | Jun 22, 2018 |
| Publication date | Dec 28, 2021 |
| Grant date | Dec 28, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed systems and methods predict visemes from an audio sequence. In an example, a viseme-generation application accesses a first audio sequence that is mapped to a sequence of visemes. The first audio sequence has a first length and represents phonemes. The application adjusts a second length of a second audio sequence such that the second length equals the first length and represents the phonemes. The application adjusts the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence. The application trains a machine-learning model with the second audio sequence and the sequence of visemes. The machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio.
Opening claim text (preview).
What is claimed is: 1. A method of predicting visemes from audio, the method comprising: accessing a first audio sequence that is mapped to a sequence of visemes, wherein the first audio sequence has a first length and represents phonemes; adjusting a second length of a second audio sequence such that the second length equals the first length and represents the phonemes; adjusting the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence; and training a machine-learning model with the second audio sequence and the sequence of visemes, wherein, when trained, the machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio. 2. The method of claim 1 , further comprising: determining, for each of the first audio sequence and the second audio sequence, a respective feature vector that comprises: a set of mel-frequency cepstrum coefficients for the respective sequence, a logarithm of a mean energy of samples in the respective sequence, and a first temporal derivative of samples in the respective sequence; and providing the feature vectors to the machine-learning model. 3. The method of claim 2 , wherein generating a first temporal derivative comprises calculating a difference between a first mel-frequency cepstrum coefficient that represents audio samples prior to the respective sequence and a second mel-frequency cepstrum coefficient that represents audio samples subsequent to the respective sequence. 4. The method of claim 1 , wherein training the machine-learning model comprises providing the first audio sequence to the machine-learning model. 5. The method of claim 1 , further comprising providing, in real-time, the additional sequence of visemes to a display device and the additional sequence of audio to an audio device. 6. The method of claim 1 , further comprising generating a visualization that corresponds to the additional sequence of visemes, wherein the generating comprises, for each viseme: accessing a list of visualizations; mapping the viseme to a visualization of list of visualizations; and configuring a display device to display the visualization. 7. The method of claim 1 , wherein training the machine-learning model comprises, iteratively: receiving a sliding window of samples from the additional sequence of audio; providing the sliding window of samples to the machine-learning model; receiving, from the machine-learning model, a prediction of a viseme; and adjusting the sliding window of samples to a subsequent set of samples from the additional sequence of audio. 8. The method of claim 1 , wherein the machine-learning model outputs the additional sequence of visemes at a first frame rate, the method further comprising adjusting the first frame rate of the additional sequence of visemes to match a second frame rate corresponding to an animated sequence and outputting the animated sequence on a display device. 9. The method of claim 1 , wherein training the machine-learning model comprises: providing a predicted viseme to a user device; receiving, from the user device, feedback that indicates (i) whether the predicted viseme is correct or (ii) whether the predicted viseme is incorrect; and adjusting the machine-learning model based on the feedback. 10. A system comprising: a non-transitory computer-readable medium storing computer-executable program instructions; a processing device communicatively coupled to the non-transitory computer-readable medium for executing the computer-executable program instructions, wherein executing the computer-executable program instructions configures the processing device to perform operations comprising: accessing a first audio sequence that is mapped to a sequence of visemes, wherein the first audio sequence has a first length and represents phonemes; adjusting a second length of a second audio sequence such that the second length equals the first length and represents the phonemes; adjusting the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence; and training a machine-learning model with the second audio sequence and the sequence of visemes, wherein, when trained, the machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio. 11. The system of claim 10 , wherein the operations further comprise: determining, for each of the first audio sequence and the second audio sequence, a respective feature vector that comprises: a set of mel-frequency cepstrum coefficients for the respective sequence, a logarithm of a mean energy of samples in the respective sequence, and a first temporal derivative of samples in the respective sequence; and providing the feature vectors to the machine-learning model. 12. The system of claim 11 , wherein generating a first temporal derivative comprises calculating a difference between a first mel-frequency cepstrum coefficient that represents audio samples prior to the respective sequence and a second mel-frequency cepstrum coefficient that represents audio samples subsequent to the respective sequence. 13. The system of claim 10 , wherein training the machine-learning model comprises providing the first audio sequence to the machine-learning model. 14. The system of claim 10 , wherein the operations further comprise providing, in real-time, the additional sequence of visemes to a display device and the additional sequence of audio to an audio device. 15. The system of claim 10 , wherein the operations further comprise: generating a visualization that corresponds to the additional sequence of visemes, wherein the generating comprises, for each viseme: accessing a list of visualizations; mapping the viseme to a visualization of list of visualizations; and configuring a display device to display the listed visualization. 16. The system of claim 10 , wherein training the machine-learning model comprises, iteratively: receiving a sliding window of samples from the additional sequence of audio; providing the sliding window of samples to the machine-learning model; receiving, from the machine-learning model, a prediction of a viseme; and adjusting the sliding window of samples to a subsequent set of samples from the additional sequence of audio. 17. A non-transitory computer-readable storage medium storing computer-executable program instructions, wherein when executed by a processing device, the computer-executable program instructions cause the processing device to perform operations comprising: accessing a first audio sequence that is mapped to a sequence of visemes, wherein the first audio sequence has a first length and represents phonemes; adjusting a second length of a second audio sequence such that the second length equals the first length and represents the phonemes; adjusting the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence; and training a machine-learning model with the second audio sequence and the sequence of visemes, wherein, when trained, the machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio. 18. The non-transitory computer-readable storage medium of claim 17 , wherein the machine-learning model outputs the additional sequence of visemes at a first frame rate, the operations f
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
Learning methods · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.