Using machine-learning models to determine movements of a mouth corresponding to live speech

US10699705B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10699705-B2
Application numberUS-201816016418-A
CountryUS
Kind codeB2
Filing dateJun 22, 2018
Priority dateJun 22, 2018
Publication dateJun 30, 2020
Grant dateJun 30, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed systems and methods predict visemes from an audio sequence. A viseme-generation application accesses a first set of training data that includes a first audio sequence representing a sentence spoken by a first speaker and a sequence of visemes. Each viseme is mapped to a respective audio sample of the first audio sequence. The viseme-generation application creates a second set of training data adjusting a second audio sequence spoken by a second speaker speaking the sentence such that the second and first sequences have the same length and at least one phoneme occurs at the same time stamp in the first sequence and in the second sequence. The viseme-generation application maps the sequence of visemes to the second audio sequence and trains a viseme prediction model to predict a sequence of visemes from an audio sequence.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of predicting visemes from an audio sequence, the method comprising: accessing a first set of training data comprising: (i) a first audio sequence of samples representing a sentence spoken by a first speaker and having a first length, wherein the audio sequence represents a sequence of phonemes, and (ii) a sequence of visemes, wherein each viseme is mapped to a respective audio sample of the first audio sequence of samples; creating a second set of training data by: accessing a second audio sequence of samples representing the same sentence spoken by a second speaker and having a second length, wherein the second audio sequence of samples comprises the sequence of phonemes; adjusting the second audio sequence of samples such that (i) a second sequence length is equal to the first length and (ii) at least one phoneme occurs at an identical time stamp in the first audio sequence of samples and in the second audio sequence of samples; mapping the sequence of visemes to the second audio sequence of samples; and training a viseme prediction model to predict a sequence of visemes from the first set of training data and the second set of training data. 2. The method of claim 1 , wherein training the viseme prediction model comprises: determining a feature vector for each sample of the respective audio sequence of each set of training data; providing the feature vectors to the viseme prediction model; receiving, from the viseme prediction model, a predicted viseme; calculating a loss function by calculating a difference between the predicted viseme and an expected viseme; and adjusting internal parameters of the viseme prediction model to minimize the loss function. 3. The method of claim 2 , wherein the feature vector comprises: a set of mel-frequency cepstrum coefficients for the samples, a logarithm of a mean energy of the samples, and a first temporal derivative of the samples. 4. The method of claim 1 , further comprising: accessing a plurality of speech samples corresponding to a time period, wherein a present subset of the speech samples corresponds to a present time period and a past subset of the speech samples corresponds to a past time period; computing a feature vector representing the plurality of speech samples; determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to the viseme prediction model trained to predict a viseme from a plurality of predetermined visemes, wherein the sesquence of predicted visemes is based on the past subset and the present subset; and providing a visualization corresponding to the sequence of predicted visemes, wherein providing the visualization comprises: accessing a list of visualizations, mapping the viseme to a listed visualization, and configuring a display device to display the listed visualization. 5. The method of claim 4 , further comprising: mapping each of the sequence of visemes to a frame rate; determining that a particular viseme of the sequence of visemes corresponds to a frame of video; and removing the particular viseme from the sequence of visemes. 6. The method of claim 4 , further comprising: mapping each of the sequence of visemes to a frame rate; delaying an output of the sequence of predicted visemes by a predetermined number of frames; and responsive to determining that (i) a current frame includes a particular viseme and (ii) a subsequent frame and a previous frame lack the particular viseme, mapping the viseme of the previous frame to the current frame. 7. The method of claim 4 , further comprising: mapping each of the sequence of visemes to a frame rate; and representing the sequence of visemes on a graphical timeline according to the frame rate. 8. A system comprising: a non-transitory computer-readable medium storing computer-executable program instructions and a processing device communicatively coupled to the non-transitory computer-readable medium for executing the computer-executable program instructions, wherein executing the computer-executable program instructions configures the processing device to perform operations comprising: accessing a plurality of speech samples corresponding to a time period, wherein a present subset of the speech samples corresponds to a present time period and a past subset of the speech samples corresponds to a past time period; computing a feature vector representing the plurality of speech samples; determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to a viseme prediction model trained with a second training data set comprising a second audio sequence spoken by a second speaker and a sequence of visemes, wherein the second training data set is created by mapping the second audio sequence to a first audio sequence; and providing a visualization corresponding to the sequence of predicted visemes, wherein providing the visualization comprises: accessing a list of visualizations, mapping each viseme of the predicted sequence of visemes to a listed visualization, and configuring a display device to display the listed visualization. 9. The system of claim 8 , further comprising: increasing an amplitude of each of the plurality of speech samples; determining, from the plurality of speech samples, a speech sample that has an amplitude greater than a threshold; and reducing the amplitude of the speech sample. 10. The system of claim 8 , wherein computing the feature vector further comprises: calculating a set of mel-frequency cepstrum coefficients for the plurality of speech samples, calculating a logarithm of a mean energy of the plurality of speech samples, and calculating a first temporal derivative of the plurality of speech samples. 11. The system of claim 8 , the operations further comprising: mapping each of the sequence of visemes to a frame rate; delaying an output of the sequence of predicted visemes by a predetermined number of frames; and responsive to determining that (i) a current frame includes a particular viseme and (ii) a subsequent frame and a previous frame lack the particular viseme, mapping a viseme of the previous frame to the current frame. 12. The system of claim 8 , the operations further comprising: mapping the sequence of predicted visemes to a frame rate; and representing the sequence of predicted visemes on a graphical timeline according to the frame rate. 13. A non-transitory computer-readable storage medium storing computer-executable program instructions, wherein when executed by a processing device, the computer-executable program instructions cause the processing device to perform operations comprising: accessing a first set of training data comprising: (i) a first audio sequence representing a sentence spoken by a first speaker and having a first length, wherein the first audio sequence represents a sequence of phonemes and has a first length, and (ii) a sequence of visemes, wherein each viseme is mapped to a respective audio sample of the first audio sequence; creating a second set of training data by: accessing a second audio sequence representing the sentence spoken by a second speaker and having a second length, wherein the second audio sequence comprises the sequence of phonemes; adjusting the first audio sequence such that (i) the first length is equal to the second length and (ii) at least one phoneme occurs at an identical time stamp in the first audio sequence and in the second audio sequence; mapping the sequence of visemes to the adjusted first audio sequence; and

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10699705B2 cover?
Disclosed systems and methods predict visemes from an audio sequence. A viseme-generation application accesses a first set of training data that includes a first audio sequence representing a sentence spoken by a first speaker and a sequence of visemes. Each viseme is mapped to a respective audio sample of the first audio sequence. The viseme-generation application creates a second set of train…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 30 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).