What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 28 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Using machine-learning models to determine movements of a mouth corresponding to live speech

US11211060B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11211060-B2
Application number	US-202016887418-A
Country	US
Kind code	B2
Filing date	May 29, 2020
Priority date	Jun 22, 2018
Publication date	Dec 28, 2021
Grant date	Dec 28, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed systems and methods predict visemes from an audio sequence. In an example, a viseme-generation application accesses a first audio sequence that is mapped to a sequence of visemes. The first audio sequence has a first length and represents phonemes. The application adjusts a second length of a second audio sequence such that the second length equals the first length and represents the phonemes. The application adjusts the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence. The application trains a machine-learning model with the second audio sequence and the sequence of visemes. The machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of predicting visemes from audio, the method comprising: accessing a first audio sequence that is mapped to a sequence of visemes, wherein the first audio sequence has a first length and represents phonemes; adjusting a second length of a second audio sequence such that the second length equals the first length and represents the phonemes; adjusting the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence; and training a machine-learning model with the second audio sequence and the sequence of visemes, wherein, when trained, the machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio. 2. The method of claim 1 , further comprising: determining, for each of the first audio sequence and the second audio sequence, a respective feature vector that comprises: a set of mel-frequency cepstrum coefficients for the respective sequence, a logarithm of a mean energy of samples in the respective sequence, and a first temporal derivative of samples in the respective sequence; and providing the feature vectors to the machine-learning model. 3. The method of claim 2 , wherein generating a first temporal derivative comprises calculating a difference between a first mel-frequency cepstrum coefficient that represents audio samples prior to the respective sequence and a second mel-frequency cepstrum coefficient that represents audio samples subsequent to the respective sequence. 4. The method of claim 1 , wherein training the machine-learning model comprises providing the first audio sequence to the machine-learning model. 5. The method of claim 1 , further comprising providing, in real-time, the additional sequence of visemes to a display device and the additional sequence of audio to an audio device. 6. The method of claim 1 , further comprising generating a visualization that corresponds to the additional sequence of visemes, wherein the generating comprises, for each viseme: accessing a list of visualizations; mapping the viseme to a visualization of list of visualizations; and configuring a display device to display the visualization. 7. The method of claim 1 , wherein training the machine-learning model comprises, iteratively: receiving a sliding window of samples from the additional sequence of audio; providing the sliding window of samples to the machine-learning model; receiving, from the machine-learning model, a prediction of a viseme; and adjusting the sliding window of samples to a subsequent set of samples from the additional sequence of audio. 8. The method of claim 1 , wherein the machine-learning model outputs the additional sequence of visemes at a first frame rate, the method further comprising adjusting the first frame rate of the additional sequence of visemes to match a second frame rate corresponding to an animated sequence and outputting the animated sequence on a display device. 9. The method of claim 1 , wherein training the machine-learning model comprises: providing a predicted viseme to a user device; receiving, from the user device, feedback that indicates (i) whether the predicted viseme is correct or (ii) whether the predicted viseme is incorrect; and adjusting the machine-learning model based on the feedback. 10. A system comprising: a non-transitory computer-readable medium storing computer-executable program instructions; a processing device communicatively coupled to the non-transitory computer-readable medium for executing the computer-executable program instructions, wherein executing the computer-executable program instructions configures the processing device to perform operations comprising: accessing a first audio sequence that is mapped to a sequence of visemes, wherein the first audio sequence has a first length and represents phonemes; adjusting a second length of a second audio sequence such that the second length equals the first length and represents the phonemes; adjusting the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence; and training a machine-learning model with the second audio sequence and the sequence of visemes, wherein, when trained, the machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio. 11. The system of claim 10 , wherein the operations further comprise: determining, for each of the first audio sequence and the second audio sequence, a respective feature vector that comprises: a set of mel-frequency cepstrum coefficients for the respective sequence, a logarithm of a mean energy of samples in the respective sequence, and a first temporal derivative of samples in the respective sequence; and providing the feature vectors to the machine-learning model. 12. The system of claim 11 , wherein generating a first temporal derivative comprises calculating a difference between a first mel-frequency cepstrum coefficient that represents audio samples prior to the respective sequence and a second mel-frequency cepstrum coefficient that represents audio samples subsequent to the respective sequence. 13. The system of claim 10 , wherein training the machine-learning model comprises providing the first audio sequence to the machine-learning model. 14. The system of claim 10 , wherein the operations further comprise providing, in real-time, the additional sequence of visemes to a display device and the additional sequence of audio to an audio device. 15. The system of claim 10 , wherein the operations further comprise: generating a visualization that corresponds to the additional sequence of visemes, wherein the generating comprises, for each viseme: accessing a list of visualizations; mapping the viseme to a visualization of list of visualizations; and configuring a display device to display the listed visualization. 16. The system of claim 10 , wherein training the machine-learning model comprises, iteratively: receiving a sliding window of samples from the additional sequence of audio; providing the sliding window of samples to the machine-learning model; receiving, from the machine-learning model, a prediction of a viseme; and adjusting the sliding window of samples to a subsequent set of samples from the additional sequence of audio. 17. A non-transitory computer-readable storage medium storing computer-executable program instructions, wherein when executed by a processing device, the computer-executable program instructions cause the processing device to perform operations comprising: accessing a first audio sequence that is mapped to a sequence of visemes, wherein the first audio sequence has a first length and represents phonemes; adjusting a second length of a second audio sequence such that the second length equals the first length and represents the phonemes; adjusting the sequence of visemes to the second audio sequence such that phonemes in the second audio sequence correspond to the phonemes in the first audio sequence; and training a machine-learning model with the second audio sequence and the sequence of visemes, wherein, when trained, the machine-learning model predicts an additional sequence of visemes based on an additional sequence of audio. 18. The non-transitory computer-readable storage medium of claim 17 , wherein the machine-learning model outputs the additional sequence of visemes at a first frame rate, the operations f

Assignees

Adobe Inc

Inventors

Classifications

G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/08Primary
Learning methods · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/09
Supervised learning · CPC title

Patent family

Related publications grouped by family.

View patent family 66381236

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11211060B2 cover?: Disclosed systems and methods predict visemes from an audio sequence. In an example, a viseme-generation application accesses a first audio sequence that is mapped to a sequence of visemes. The first audio sequence has a first length and represents phonemes. The application adjusts a second length of a second audio sequence such that the second length equals the first length and represents the …
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 28 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).