Harmony-aware human motion synthesis with music
US-2023005201-A1 · Jan 5, 2023 · US
US12333639B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12333639-B2 |
| Application number | US-202118007867-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 8, 2021 |
| Priority date | Nov 8, 2021 |
| Publication date | Jun 17, 2025 |
| Grant date | Jun 17, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In various examples, animations may be generated using audio-driven body animation synthesized with voice tempo. For example, full body animation may be driven from an audio input representative of recorded speech, where voice tempo (e.g., a number of phonemes per unit time) may be used to generate a 1D audio signal for comparing to datasets including data samples that each include an animation and a corresponding 1D audio signal. One or more loss functions may be used to compare the 1D audio signal from the input audio to the audio signals of the datasets, as well as to compare joint information of joints of an actor between animations of two or more data samples, in order to identify optimal transition points between the animations. The animations may then be stitched together—e.g., using interpolation and/or a neural network trained to seamlessly stitch sequences together—using the transition points.
Opening claim text (preview).
What is claimed is: 1. A processor comprising: one or more circuits to: generate an audio signal from input audio data; compute, using a first loss function, one or more differences between the audio signal and audio signals of a plurality of data samples; determine, based at least on the one or more differences, at least a first data sample and a second data sample from the plurality of data samples, the first data sample including a first audio signal corresponding to a first animation and the second data sample including a second audio signal corresponding to a second animation; determine, using the first loss function and a second loss function that compares the first animation and the second animation, a transition point between the first audio signal and the second audio signal; and based at least on the transition point, generate an animation based at least on combining at least a portion of the first animation and at least a portion of the second animation. 2. The processor of claim 1 , wherein the animation is generated using the one or more circuits by stitching at least the portion of the first animation with at least an initial portion of the second animation using interpolation between one or more angles corresponding to one or more joints of an animated actor in the first animation and one or more joints of the animated actor in at least the initial portion of the second animation. 3. The processor of claim 1 , wherein the animation is generated using the one or more circuits by stitching at least the portion of the first animation with at least the portion of the second animation using a deep neural network trained to generate intermediate animation frames between animations. 4. The processor of claim 1 , wherein the audio signal includes a one-dimensional audio signal representative of a tempo of the input audio data. 5. The processor of claim 1 , wherein the second loss function is based on one or more second differences between at least one of: one or more first locations of one or more joints of an actor in the first animation and one or more second locations of the one or more joints of the actor in the second animation, or one or more first velocities of the one or more joints of the actor in the first animation and one or more second velocities of the one or more joints of the actor in the second animation. 6. The processor of claim 5 , wherein at least one of the one or more differences or the one or more second differences are computed using a mean squared difference. 7. The processor of claim 1 , wherein the audio signal is generated using a neural network that includes one or more first layers to compute a latent space feature representation of the input audio data and one or more second layers to compute the audio signal using the latent space feature representation. 8. The processor of claim 1 , further comprising processing circuitry to cause display of the animation on at least one of: a heads up display of a machine, a display of a dashboard or instrument panel of a machine, a display of a center console of a machine, a display of a computing device, a display of a smart-home device, a display of a mobile device, a display of a virtual reality (VR), augmented reality (AR), or mixed reality (MR) device, or a display of a wearable device. 9. The processor of claim 1 , wherein the animation corresponds to an animated actor associated with at least one of: an intelligent virtual assistant, a character in a gaming application, an assistant in a chat or video conferencing application, or a translator in a sign language application. 10. The processor of claim 1 , wherein the processor is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 11. The processor of claim 1 , the one or more circuits further to: compute, using the second loss function, one or more scores indicative of one or more magnitudes of one or more differences between one or more first poses of an actor in the first animation and one or more second poses of the actor in the second animation, wherein the determination of the transition point is further based at least on the one or more scores. 12. A system comprising: one or more microphones; one or more memory units; and one or more processing units comprising processing circuity to: generate, using a neural network, an audio signal representative of a tempo associated with input audio data obtained using the one or more microphones; determine, based at least on a first computed difference between the audio signal and each of a plurality of audio signals associated with a dataset, at least a first data sample and a second data sample; determine, based at least on a second computed difference between one or more joints of an actor in a first animation associated with the first data sample and the one or more joints of the actor in a second animation associated with the second data sample, a transition point between the first animation and the second animation; and generate an animation based at least on combining at least a portion of the first animation with at least a portion of the second animation based at least on the transition point. 13. The system of claim 12 , wherein the tempo corresponds to a number of phonetic units pronounced in a given time unit. 14. The system of claim 12 , wherein the first computed difference is computed using a first loss function and the second computed difference is computed using a second loss function. 15. The system of claim 12 , wherein the neural network includes one or more first layers to compute a latent space feature representation of the input audio data and one or more second layers to compute the audio signal using the latent space feature representation. 16. The system of claim 12 , wherein the second computed difference corresponds to differences between at least one of: locations of the one or more joints of an actor in the first animation and locations of the one or more joints of the actor in the second animation, or velocities of the one or more joints of the actor in the first animation and velocities of the one or more joints of the actor in the second animation. 17. The system of claim 12 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 18. The system of claim 12 , wherein: the first data sample includes a first audio signal, the second data sample includes a second audio signal, the first computed difference between the audio signal and each of the first audio signal and the second audio signal is less than a threshold, and the determination of the at least the first data sample and the second data sample is based at least on the first computed dif
of characters, e.g. humans, animals or virtual beings · CPC title
Non-supervised learning, e.g. competitive learning · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Ensemble learning · CPC title
using kernel methods, e.g. support vector machines [SVM] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.