Synthetic audio-driven body animation using voice tempo

US12333639B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12333639-B2
Application numberUS-202118007867-A
CountryUS
Kind codeB2
Filing dateNov 8, 2021
Priority dateNov 8, 2021
Publication dateJun 17, 2025
Grant dateJun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In various examples, animations may be generated using audio-driven body animation synthesized with voice tempo. For example, full body animation may be driven from an audio input representative of recorded speech, where voice tempo (e.g., a number of phonemes per unit time) may be used to generate a 1D audio signal for comparing to datasets including data samples that each include an animation and a corresponding 1D audio signal. One or more loss functions may be used to compare the 1D audio signal from the input audio to the audio signals of the datasets, as well as to compare joint information of joints of an actor between animations of two or more data samples, in order to identify optimal transition points between the animations. The animations may then be stitched together—e.g., using interpolation and/or a neural network trained to seamlessly stitch sequences together—using the transition points.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: one or more circuits to: generate an audio signal from input audio data; compute, using a first loss function, one or more differences between the audio signal and audio signals of a plurality of data samples; determine, based at least on the one or more differences, at least a first data sample and a second data sample from the plurality of data samples, the first data sample including a first audio signal corresponding to a first animation and the second data sample including a second audio signal corresponding to a second animation; determine, using the first loss function and a second loss function that compares the first animation and the second animation, a transition point between the first audio signal and the second audio signal; and based at least on the transition point, generate an animation based at least on combining at least a portion of the first animation and at least a portion of the second animation. 2. The processor of claim 1 , wherein the animation is generated using the one or more circuits by stitching at least the portion of the first animation with at least an initial portion of the second animation using interpolation between one or more angles corresponding to one or more joints of an animated actor in the first animation and one or more joints of the animated actor in at least the initial portion of the second animation. 3. The processor of claim 1 , wherein the animation is generated using the one or more circuits by stitching at least the portion of the first animation with at least the portion of the second animation using a deep neural network trained to generate intermediate animation frames between animations. 4. The processor of claim 1 , wherein the audio signal includes a one-dimensional audio signal representative of a tempo of the input audio data. 5. The processor of claim 1 , wherein the second loss function is based on one or more second differences between at least one of: one or more first locations of one or more joints of an actor in the first animation and one or more second locations of the one or more joints of the actor in the second animation, or one or more first velocities of the one or more joints of the actor in the first animation and one or more second velocities of the one or more joints of the actor in the second animation. 6. The processor of claim 5 , wherein at least one of the one or more differences or the one or more second differences are computed using a mean squared difference. 7. The processor of claim 1 , wherein the audio signal is generated using a neural network that includes one or more first layers to compute a latent space feature representation of the input audio data and one or more second layers to compute the audio signal using the latent space feature representation. 8. The processor of claim 1 , further comprising processing circuitry to cause display of the animation on at least one of: a heads up display of a machine, a display of a dashboard or instrument panel of a machine, a display of a center console of a machine, a display of a computing device, a display of a smart-home device, a display of a mobile device, a display of a virtual reality (VR), augmented reality (AR), or mixed reality (MR) device, or a display of a wearable device. 9. The processor of claim 1 , wherein the animation corresponds to an animated actor associated with at least one of: an intelligent virtual assistant, a character in a gaming application, an assistant in a chat or video conferencing application, or a translator in a sign language application. 10. The processor of claim 1 , wherein the processor is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 11. The processor of claim 1 , the one or more circuits further to: compute, using the second loss function, one or more scores indicative of one or more magnitudes of one or more differences between one or more first poses of an actor in the first animation and one or more second poses of the actor in the second animation, wherein the determination of the transition point is further based at least on the one or more scores. 12. A system comprising: one or more microphones; one or more memory units; and one or more processing units comprising processing circuity to: generate, using a neural network, an audio signal representative of a tempo associated with input audio data obtained using the one or more microphones; determine, based at least on a first computed difference between the audio signal and each of a plurality of audio signals associated with a dataset, at least a first data sample and a second data sample; determine, based at least on a second computed difference between one or more joints of an actor in a first animation associated with the first data sample and the one or more joints of the actor in a second animation associated with the second data sample, a transition point between the first animation and the second animation; and generate an animation based at least on combining at least a portion of the first animation with at least a portion of the second animation based at least on the transition point. 13. The system of claim 12 , wherein the tempo corresponds to a number of phonetic units pronounced in a given time unit. 14. The system of claim 12 , wherein the first computed difference is computed using a first loss function and the second computed difference is computed using a second loss function. 15. The system of claim 12 , wherein the neural network includes one or more first layers to compute a latent space feature representation of the input audio data and one or more second layers to compute the audio signal using the latent space feature representation. 16. The system of claim 12 , wherein the second computed difference corresponds to differences between at least one of: locations of the one or more joints of an actor in the first animation and locations of the one or more joints of the actor in the second animation, or velocities of the one or more joints of the actor in the first animation and velocities of the one or more joints of the actor in the second animation. 17. The system of claim 12 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 18. The system of claim 12 , wherein: the first data sample includes a first audio signal, the second data sample includes a second audio signal, the first computed difference between the audio signal and each of the first audio signal and the second audio signal is less than a threshold, and the determination of the at least the first data sample and the second data sample is based at least on the first computed dif

Assignees

Inventors

Classifications

  • G06T13/40Primary

    of characters, e.g. humans, animals or virtual beings · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Ensemble learning · CPC title

  • using kernel methods, e.g. support vector machines [SVM] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12333639B2 cover?
In various examples, animations may be generated using audio-driven body animation synthesized with voice tempo. For example, full body animation may be driven from an audio input representative of recorded speech, where voice tempo (e.g., a number of phonemes per unit time) may be used to generate a 1D audio signal for comparing to datasets including data samples that each include an animation…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06T13/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).