System and method for rendering three dimensional face model based on audio stream and image data

US11113859B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11113859-B1
Application numberUS-201916507862-A
CountryUS
Kind codeB1
Filing dateJul 10, 2019
Priority dateJul 10, 2019
Publication dateSep 7, 2021
Grant dateSep 7, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein includes a system, a method, and a non-transitory computer readable medium for rendering a three-dimensional (3D) model of an avatar according to an audio stream including a vocal output of a person and image data capturing a face of the person. In one aspect, phonemes of the vocal output are predicted according to the audio stream, and the predicted phonemes of the vocal output are translated into visemes. In one aspect, a plurality of blendshapes and corresponding weights are determined, according to the corresponding image data of the face, to form the 3D model of the avatar of the person. The visemes may be combined with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar in time.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, by one or more processors through a microphone, a vocal output of a person over a plurality of time instances, as an audio stream; acquiring, by the one or more processors through an imaging device, images of facial expressions of the person over the plurality of time instances, as corresponding image data; predicting, by the one or more processors, phonemes of the vocal output according to the audio stream; translating, by the one or more processors, the predicted phonemes of the vocal output into visemes; determining, by the one or more processors, a plurality of blendshapes and corresponding weights, according to the corresponding image data, to form a three-dimensional (3D) model of an avatar of the person incorporating the facial expressions of the person, the plurality of blendshapes comprising 3D structures, the corresponding weights indicating an amount of transformation applied to the 3D structures of the 3D model of the avatar; combining, by the one or more processors, the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar over the plurality of time instances; and rendering, by the one or more processors, the 3D representation of the avatar with the incorporated facial expressions of the person in synchronization with rendering of the audio stream. 2. The method of claim 1 , wherein predicting the phonemes of the vocal output comprises determining, by the one or more processors, probabilities of the phonemes of the vocal output according to the audio stream. 3. The method of claim 1 , comprising determining, by the one or more processors, the plurality of blendshapes and corresponding weights according to a facial action coding system. 4. The method of claim 1 , comprising determining the corresponding weights by: determining, by the one or more processors, a base set of landmarks of a face of the person from the corresponding image data; determining, by the one or more processors, a first set of weights; determining, by the one or more processors, a first candidate set of landmarks using a 3D model of the face formed according to the first set of weights; comparing, by the one or more processors, the base set of landmarks and the first candidate set of landmarks; and determining, by the one or more processors, a second set of weights that results in a second candidate set of landmarks, and a difference between the second candidate set of landmarks and the base set of landmarks that is less than a difference between the first candidate set of landmarks and the base set of landmarks. 5. The method of claim 1 , comprising generating the 3D model of the avatar by combining, by the one or more processors, the plurality of blendshapes according to the corresponding weights. 6. The method of claim 5 , wherein combining the visemes with the 3D model of the avatar includes morphing or replacing, by the one or more processors, at least a portion of a mouth of the 3D model corresponding to a first time instance, according to one of the visemes corresponding to the first time instance. 7. A system comprising: a microphone; an imaging device; one or more processors coupled to the microphone and the imaging device; and a non-transitory computer readable medium storing instructions when executed by the one or more processors cause the one or more processors to: receive, through the microphone, a vocal output of a person over a plurality of time instances, as an audio stream; acquire, through the imaging device, images of facial expressions of the person over the plurality of time instances, as corresponding image data; predict phonemes of the vocal output according to the audio stream; translate the predicted phonemes of the vocal output into visemes; determine a plurality of blendshapes and corresponding weights, according to the corresponding image data of the face, to form a three-dimensional (3D) model of an avatar of the person incorporating the facial expressions of the person, the plurality of blendshapes comprising 3D structures, the corresponding weights indicating an amount of transformation applied to the 3D structures of the 3D model of the avatar; combine the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar over the plurality of time instances; and render the 3D representation of the avatar with the incorporated facial expressions of the person in synchronization with rendering of the audio stream. 8. The system of claim 7 , wherein the one or more processors predict the phonemes of the vocal output by determining probabilities of the phonemes of the vocal output according to the audio stream. 9. The system of claim 7 , wherein the one or more processors determine the plurality of blendshapes and corresponding weights according to a facial action coding system. 10. The system of claim 7 , wherein the one or more processors determine the corresponding weights by: determining a base set of landmarks of a face of the person from the corresponding image data; determining a first set of weights; determining a first candidate set of landmarks using a 3D model of the face formed according to the first set of weights; comparing the base set of landmarks and the first candidate set of landmarks; and determining a second set of weights that results in a second candidate set of landmarks, and a difference between the second candidate set of landmarks and the base set of landmarks that is less than a difference between the first candidate set of landmarks and the base set of landmarks. 11. The system of claim 7 , wherein the one or more processors generate the 3D model of the avatar by combining the plurality of blendshapes according to the corresponding weights. 12. The system of claim 11 , wherein the one or more processors combine the visemes with the 3D model of the avatar by morphing or replacing at least a portion of a mouth of the 3D model corresponding to a first time instance, according to one of the visemes corresponding to the first time instance. 13. A non-transitory computer readable medium storing instructions when executed by one or more processors cause the one or more processors to: receive, through a microphone, a vocal output of a person over a plurality of time instances, as an audio stream; acquire, through an imaging device, images of facial expressions of the person over the plurality of time instances, as corresponding image data; predict phonemes of the vocal output according to the audio stream; translate the predicted phonemes of the vocal output into visemes; determine a plurality of blendshapes and corresponding weights, according to the corresponding image data of the face, to form a three-dimensional (3D) model of an avatar of the person incorporating the facial expressions of the person, the plurality of blendshapes comprising 3D structures, the corresponding weights indicating an amount of transformation applied to the 3D structures of the 3D model of the avatar; combine the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar over the plurality of time instances; and render the 3D representation of the avatar with the incorporated facial expressions of the person in synchronization with rendering of the audio stream. 14. The non-transitory computer readable medium of claim 13 , wherein the instructions that cause the one or more process

Assignees

Inventors

Classifications

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Transforming into visible information · CPC title

  • Morphing · CPC title

  • G06T13/205Primary

    driven by audio data · CPC title

  • of characters, e.g. humans, animals or virtual beings · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11113859B1 cover?
Disclosed herein includes a system, a method, and a non-transitory computer readable medium for rendering a three-dimensional (3D) model of an avatar according to an audio stream including a vocal output of a person and image data capturing a face of the person. In one aspect, phonemes of the vocal output are predicted according to the audio stream, and the predicted phonemes of the vocal outpu…
Who is the assignee on this patent?
Facebook Tech Llc
What technology area does this patent fall under?
Primary CPC classification G06T13/205. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 07 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).