Methods for age appearance simulation
US-10621771-B2 · Apr 14, 2020 · US
US11113859B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11113859-B1 |
| Application number | US-201916507862-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jul 10, 2019 |
| Priority date | Jul 10, 2019 |
| Publication date | Sep 7, 2021 |
| Grant date | Sep 7, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed herein includes a system, a method, and a non-transitory computer readable medium for rendering a three-dimensional (3D) model of an avatar according to an audio stream including a vocal output of a person and image data capturing a face of the person. In one aspect, phonemes of the vocal output are predicted according to the audio stream, and the predicted phonemes of the vocal output are translated into visemes. In one aspect, a plurality of blendshapes and corresponding weights are determined, according to the corresponding image data of the face, to form the 3D model of the avatar of the person. The visemes may be combined with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar in time.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving, by one or more processors through a microphone, a vocal output of a person over a plurality of time instances, as an audio stream; acquiring, by the one or more processors through an imaging device, images of facial expressions of the person over the plurality of time instances, as corresponding image data; predicting, by the one or more processors, phonemes of the vocal output according to the audio stream; translating, by the one or more processors, the predicted phonemes of the vocal output into visemes; determining, by the one or more processors, a plurality of blendshapes and corresponding weights, according to the corresponding image data, to form a three-dimensional (3D) model of an avatar of the person incorporating the facial expressions of the person, the plurality of blendshapes comprising 3D structures, the corresponding weights indicating an amount of transformation applied to the 3D structures of the 3D model of the avatar; combining, by the one or more processors, the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar over the plurality of time instances; and rendering, by the one or more processors, the 3D representation of the avatar with the incorporated facial expressions of the person in synchronization with rendering of the audio stream. 2. The method of claim 1 , wherein predicting the phonemes of the vocal output comprises determining, by the one or more processors, probabilities of the phonemes of the vocal output according to the audio stream. 3. The method of claim 1 , comprising determining, by the one or more processors, the plurality of blendshapes and corresponding weights according to a facial action coding system. 4. The method of claim 1 , comprising determining the corresponding weights by: determining, by the one or more processors, a base set of landmarks of a face of the person from the corresponding image data; determining, by the one or more processors, a first set of weights; determining, by the one or more processors, a first candidate set of landmarks using a 3D model of the face formed according to the first set of weights; comparing, by the one or more processors, the base set of landmarks and the first candidate set of landmarks; and determining, by the one or more processors, a second set of weights that results in a second candidate set of landmarks, and a difference between the second candidate set of landmarks and the base set of landmarks that is less than a difference between the first candidate set of landmarks and the base set of landmarks. 5. The method of claim 1 , comprising generating the 3D model of the avatar by combining, by the one or more processors, the plurality of blendshapes according to the corresponding weights. 6. The method of claim 5 , wherein combining the visemes with the 3D model of the avatar includes morphing or replacing, by the one or more processors, at least a portion of a mouth of the 3D model corresponding to a first time instance, according to one of the visemes corresponding to the first time instance. 7. A system comprising: a microphone; an imaging device; one or more processors coupled to the microphone and the imaging device; and a non-transitory computer readable medium storing instructions when executed by the one or more processors cause the one or more processors to: receive, through the microphone, a vocal output of a person over a plurality of time instances, as an audio stream; acquire, through the imaging device, images of facial expressions of the person over the plurality of time instances, as corresponding image data; predict phonemes of the vocal output according to the audio stream; translate the predicted phonemes of the vocal output into visemes; determine a plurality of blendshapes and corresponding weights, according to the corresponding image data of the face, to form a three-dimensional (3D) model of an avatar of the person incorporating the facial expressions of the person, the plurality of blendshapes comprising 3D structures, the corresponding weights indicating an amount of transformation applied to the 3D structures of the 3D model of the avatar; combine the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar over the plurality of time instances; and render the 3D representation of the avatar with the incorporated facial expressions of the person in synchronization with rendering of the audio stream. 8. The system of claim 7 , wherein the one or more processors predict the phonemes of the vocal output by determining probabilities of the phonemes of the vocal output according to the audio stream. 9. The system of claim 7 , wherein the one or more processors determine the plurality of blendshapes and corresponding weights according to a facial action coding system. 10. The system of claim 7 , wherein the one or more processors determine the corresponding weights by: determining a base set of landmarks of a face of the person from the corresponding image data; determining a first set of weights; determining a first candidate set of landmarks using a 3D model of the face formed according to the first set of weights; comparing the base set of landmarks and the first candidate set of landmarks; and determining a second set of weights that results in a second candidate set of landmarks, and a difference between the second candidate set of landmarks and the base set of landmarks that is less than a difference between the first candidate set of landmarks and the base set of landmarks. 11. The system of claim 7 , wherein the one or more processors generate the 3D model of the avatar by combining the plurality of blendshapes according to the corresponding weights. 12. The system of claim 11 , wherein the one or more processors combine the visemes with the 3D model of the avatar by morphing or replacing at least a portion of a mouth of the 3D model corresponding to a first time instance, according to one of the visemes corresponding to the first time instance. 13. A non-transitory computer readable medium storing instructions when executed by one or more processors cause the one or more processors to: receive, through a microphone, a vocal output of a person over a plurality of time instances, as an audio stream; acquire, through an imaging device, images of facial expressions of the person over the plurality of time instances, as corresponding image data; predict phonemes of the vocal output according to the audio stream; translate the predicted phonemes of the vocal output into visemes; determine a plurality of blendshapes and corresponding weights, according to the corresponding image data of the face, to form a three-dimensional (3D) model of an avatar of the person incorporating the facial expressions of the person, the plurality of blendshapes comprising 3D structures, the corresponding weights indicating an amount of transformation applied to the 3D structures of the 3D model of the avatar; combine the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar over the plurality of time instances; and render the 3D representation of the avatar with the incorporated facial expressions of the person in synchronization with rendering of the audio stream. 14. The non-transitory computer readable medium of claim 13 , wherein the instructions that cause the one or more process
Phonemes, fenemes or fenones being the recognition units · CPC title
Transforming into visible information · CPC title
Morphing · CPC title
driven by audio data · CPC title
of characters, e.g. humans, animals or virtual beings · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.