Who is the assignee on this patent?

Wang Lijuan, Soong Frank, Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L21/10. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 08 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Photo-realistic synthesis of image sequences with lip movements synchronized with speech

US9728203B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9728203-B2
Application number	US-201113098488-A
Country	US
Kind code	B2
Filing date	May 2, 2011
Priority date	May 2, 2011
Publication date	Aug 8, 2017
Grant date	Aug 8, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which a synthesized image sequence will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with lip movements synchronized with the desired speech.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating photo-realistic facial animation synchronized with speech, comprising: storing, in a computer memory or computer storage device, a statistical model of audiovisual data over time, based on acoustic feature vectors obtained from actual audio data and visual feature vectors obtained from real sample images of an individual's articulators during speech; storing, in an image library, the real sample images of the individual's articulators during speech, including storing for each of the stored real sample images the visual feature vectors obtained from the real sample image as used to generate the statistical model; receiving an input set of acoustic feature vectors for the speech with which the facial animation is to be synchronized; using a computer processor, applying the received input set of acoustic feature vectors to the statistical model, the statistical model thereby generating a visual feature vector sequence; selecting, using a computer processor, a sequence of real sample images from the image library, such that the selected sequence matches the visual feature vector sequence generated by the statistical model by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and using the computer processor, concatenating the selected sequence of real sample images to provide a photo-realistic image sequence of a talking head with lips movements synchronized with the speech. 2. The computer-implemented method of claim 1 , further comprising generating the statistical model, the generating comprising: obtaining actual audiovisual data including real sample images of the individual's articulators for a set of utterances; extracting the acoustic feature vectors and the visual feature vectors for each sample of the audiovisual data; and training the statistical model using the acoustic feature vectors and the visual feature vectors. 3. The computer-implemented method of claim 1 , wherein generating the visual feature vector sequence comprises maximizing a likelihood function with respect to the input acoustic feature vectors and the statistical model. 4. The computer-implemented method of claim 1 , wherein selecting the sequence of real sample images comprises selecting a set of real sample images that minimizes a cost function. 5. The computer-implemented method of claim 4 , wherein the cost function comprises a target cost indicative of a difference between a visual feature vector in the generated visual feature vector sequence and a visual feature vector related to a real sample image. 6. The computer-implemented method of claim 5 , wherein the cost function comprises a concatenation cost indicative of a difference between adjacent real sample images in the selected sequence of real sample images. 7. The computer-implemented method of claim 1 , wherein selecting the sequence of real sample images from the image library comprises identifying a sequence of real sample images from the image library having visual feature vectors that matches the generated visual feature vector sequence based on both a target cost and a concatenation cost. 8. A computer system for generating photo-realistic facial animation with speech, comprising: a computer memory or computer storage device storing a statistical model of audiovisual data over time, based on acoustic feature vectors obtained from actual audio data and visual feature vectors obtained from real sample images of an individual's articulators during a set of utterances; an image library storing the real sample images of the individual's articulators during the set of utterances, the image library further storing for each of the stored real sample images the visual feature vectors obtained from the real sample image as used to generate the statistical model; a synthesis module having an input for receiving an input set of feature vectors for speech with which the facial animation is to be synchronized, and providing as an output a visual feature vector sequence corresponding to the input set of feature vectors according to the statistical model; an image selection module having an input for receiving the visual feature vector sequence from the output of the synthesis module, and accessing the image library using the received visual feature vector sequence to generate an output providing a sequence of real sample images from the image library having visual feature vectors that match the visual feature vectors in the visual feature vector sequence received from the synthesis module by comparing visual feature vectors in the visual feature vector sequence with visual feature vectors associated with the real sample images in the image library; and a synthesis module having an input for receiving the sequence of real sample images from the image selection module, and concatenating the real sample images to provide a photo-realistic image sequence of a talking head with lips movements synchronized with the speech. 9. The computer system of claim 8 , further comprising: a training module having an input receiving acoustic feature vectors and visual feature vectors from the audiovisual data of an individual's articulators during a set of utterances and providing as an output a statistical model of the audiovisual data over time. 10. The computer system of claim 9 , wherein the training module comprises: a feature extraction module having an input for receiving the audiovisual data and providing an output including the acoustic feature vectors and the visual feature vectors corresponding to each sample of the audiovisual data; and a statistical model training module having an input for receiving the acoustic feature vectors and the visual feature vectors and providing as an output the statistical model. 11. The computer system of claim 8 , wherein the synthesis module implements a maximum likelihood function with respect to the input acoustic feature vectors and the statistical model. 12. The computer system of claim 8 , wherein the image selection module implements a cost function and identifies a set of real sample images that minimizes the cost function. 13. The computer system of claim 12 , wherein the cost function comprises a target cost indicative of a difference between a visual feature vector in the visual feature vector sequence and a visual feature vector related to a real sample image. 14. The computer system of claim 13 , wherein the cost function comprises a concatenation cost indicative of a difference between adjacent real sample images in the sequence of real sample images. 15. The computer system of claim 8 , wherein the image selection module accesses the image library using the visual feature vector sequence to identify a sequence of real sample images from the image library having visual feature vectors that matches the visual feature vector sequence based on both a target cost and a concatenation cost. 16. A computer program product comprising: a computer memory or computer storage device; computer program instructions stored on the computer storage medium that, when processed by a computing device, instruct the computing device to perform a method for generating photo-realistic facial animation with speech, comprising: storing in a computer storage medium a statistical model of audiovisual data over time, based on acoustic feature vectors obtained from actual audio data and visual feature vectors obtained from real sample images of an individual's articulators d

Assignees

Inventors

Classifications

G10L2021/105
Synthesis of the lips movements from speech, e.g. for talking heads · CPC title
G10L21/10Primary
Transforming into visible information · CPC title

Patent family

Related publications grouped by family.

View patent family 47090831

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9728203B2 cover?: Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which a synthesized image sequence will be synchronized is provided. The statistical model is used to generate a…
Who is the assignee on this patent?: Wang Lijuan, Soong Frank, Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L21/10. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 08 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).