Speech-driven animation method and apparatus based on artificial intelligence

US12002138B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12002138-B2
Application numberUS-202117497622-A
CountryUS
Kind codeB2
Filing dateOct 8, 2021
Priority dateAug 29, 2019
Publication dateJun 4, 2024
Grant dateJun 4, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of this application disclose a speech-driven animation method and apparatus based on artificial intelligence (AI). The method includes obtaining a first speech, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information; and enabling, according to the expression parameter, an animation character to make an expression corresponding to the first speech.

First claim

Opening claim text (preview).

What is claimed is: 1. A speech-driven animation method, performed by an audio and video processing device, the method comprising: obtaining a first speech with an acoustic feature, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech by applying a neural network mapping model to extract the acoustic feature, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information, wherein the expression parameters do not reflect pronunciation habits of different speakers; and enabling, according to the expression parameter, an animation character to make an expression corresponding to the first speech. 2. The method according to claim 1 , wherein a target speech frame is a speech frame in the first speech, and the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining a speech frame set in which the target speech frame is located, the speech frame set comprising the target speech frame and speech frames preceding and succeeding the target speech frame; and determining an expression parameter corresponding to the target speech frame according to linguistics information corresponding to each speech frame in the speech frame set. 3. The method according to claim 2 , wherein a quantity of speech frames in the speech frame set is determined according to a neural network mapping model, or a quantity of speech frames in the speech frame set is determined according to a speech segmentation result of the first speech. 4. The method according to claim 2 , wherein the speech frames preceding and succeeding the target speech frame are consecutive preceding and succeeding speech frames, or the speech frames preceding and succeeding the target speech frame are inconsecutive speech frames. 5. The method according to claim 2 , wherein the determining an expression parameter corresponding to the target speech frame according to linguistics information corresponding to each speech frame in the speech frame set comprises: determining an undetermined expression parameter corresponding to each speech frame in the speech frame set according to the linguistics information corresponding to each speech frame in the speech frame set; and calculating the expression parameter corresponding to the target speech frame according to undetermined expression parameters of the target speech frame that are respectively determined in different speech frame sets. 6. The method according to claim 1 , wherein the linguistics information comprises any one or a combination of two or more of a phonetic posterior gram (PPG), a bottleneck feature, and an embedding feature. 7. The method according to claim 1 , wherein the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining the expression parameter corresponding to the speech frame in the first speech according to the linguistics information by using the neural network mapping model, the neural network mapping model comprising a deep neural network (DNN) model, a long short-term memory (LSTM) model, or a bidirectional long short-term memory (BLSTM) model. 8. The method according to claim 1 , wherein the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining the expression parameter corresponding to the speech frame in the first speech according to the linguistics information and a sentiment vector corresponding to the first speech. 9. The method according to claim 1 , wherein the determining linguistics information corresponding to a speech frame in the first speech comprises: determining an acoustic feature corresponding to the speech frame in the first speech; and determining linguistics information corresponding to the acoustic feature by using an automatic speech recognition (ASR) model. 10. The method according to claim 9 , wherein the ASR model is obtained through training according to training samples that comprise correspondences between speech segments and phonemes. 11. A speech-driven animation apparatus, deployed on an audio and video processing device, the apparatus comprising a processor and a memory, the memory being configured to store program code and transmit the program code to the processor; and when executing the program code, the processor being configured to: obtain a first speech with an acoustic feature, the first speech comprising a plurality of speech frames; determine linguistics information corresponding to a speech frame in the first speech by applying a neural network mapping model to extract the acoustic feature, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determine an expression parameter corresponding to the speech frame in the first speech according to the linguistics information, wherein the expression parameters do not reflect pronunciation habits of different speakers; and enable according to the expression parameter, an animation character to make an expression corresponding to the first speech. 12. The apparatus according to claim 11 , wherein a target speech frame is a speech frame in the first speech, and for the target speech frame, the processor is further configured to: determine a speech frame set in which the target speech frame is located, the speech frame set comprising the target speech frame and preceding and succeeding speech frames of the target speech frame; and determine an expression parameter corresponding to the target speech frame according to linguistics information corresponding to each speech frame in the speech frame set. 13. The method according to claim 1 , wherein the expression includes a mouth shape, a facial action, or a head posture. 14. A non-transitory computer-readable storage medium, configured to store program code, the program code, when being executed by a processor, causing the processor to perform: obtaining a first speech with an acoustic feature, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech by applying a neural network mapping model to extract the acoustic feature, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information, wherein the expression parameters do not reflect pronunciation habits of different speakers; and enabling, according to the expression parameter, an animation character to make an expression corresponding to the first speech. 15. The computer-readable storage medium according to claim 14 , wherein a target speech frame is a speech frame in the first speech, and the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining a speech frame set in which the target speech frame is located, the speech frame set comprising the target speech frame and the speech frames preceding and succeeding the target spe

Assignees

Inventors

Classifications

  • Feedforward networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • G06T13/205Primary

    driven by audio data · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12002138B2 cover?
Embodiments of this application disclose a speech-driven animation method and apparatus based on artificial intelligence (AI). The method includes obtaining a first speech, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech, the linguistics information being used for identifying a distribution possibi…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T13/205. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 04 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).