Speech-driven animation method and apparatus based on artificial intelligence
US-2022044463-A1 · Feb 10, 2022 · US
US12002138B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12002138-B2 |
| Application number | US-202117497622-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 8, 2021 |
| Priority date | Aug 29, 2019 |
| Publication date | Jun 4, 2024 |
| Grant date | Jun 4, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of this application disclose a speech-driven animation method and apparatus based on artificial intelligence (AI). The method includes obtaining a first speech, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information; and enabling, according to the expression parameter, an animation character to make an expression corresponding to the first speech.
Opening claim text (preview).
What is claimed is: 1. A speech-driven animation method, performed by an audio and video processing device, the method comprising: obtaining a first speech with an acoustic feature, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech by applying a neural network mapping model to extract the acoustic feature, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information, wherein the expression parameters do not reflect pronunciation habits of different speakers; and enabling, according to the expression parameter, an animation character to make an expression corresponding to the first speech. 2. The method according to claim 1 , wherein a target speech frame is a speech frame in the first speech, and the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining a speech frame set in which the target speech frame is located, the speech frame set comprising the target speech frame and speech frames preceding and succeeding the target speech frame; and determining an expression parameter corresponding to the target speech frame according to linguistics information corresponding to each speech frame in the speech frame set. 3. The method according to claim 2 , wherein a quantity of speech frames in the speech frame set is determined according to a neural network mapping model, or a quantity of speech frames in the speech frame set is determined according to a speech segmentation result of the first speech. 4. The method according to claim 2 , wherein the speech frames preceding and succeeding the target speech frame are consecutive preceding and succeeding speech frames, or the speech frames preceding and succeeding the target speech frame are inconsecutive speech frames. 5. The method according to claim 2 , wherein the determining an expression parameter corresponding to the target speech frame according to linguistics information corresponding to each speech frame in the speech frame set comprises: determining an undetermined expression parameter corresponding to each speech frame in the speech frame set according to the linguistics information corresponding to each speech frame in the speech frame set; and calculating the expression parameter corresponding to the target speech frame according to undetermined expression parameters of the target speech frame that are respectively determined in different speech frame sets. 6. The method according to claim 1 , wherein the linguistics information comprises any one or a combination of two or more of a phonetic posterior gram (PPG), a bottleneck feature, and an embedding feature. 7. The method according to claim 1 , wherein the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining the expression parameter corresponding to the speech frame in the first speech according to the linguistics information by using the neural network mapping model, the neural network mapping model comprising a deep neural network (DNN) model, a long short-term memory (LSTM) model, or a bidirectional long short-term memory (BLSTM) model. 8. The method according to claim 1 , wherein the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining the expression parameter corresponding to the speech frame in the first speech according to the linguistics information and a sentiment vector corresponding to the first speech. 9. The method according to claim 1 , wherein the determining linguistics information corresponding to a speech frame in the first speech comprises: determining an acoustic feature corresponding to the speech frame in the first speech; and determining linguistics information corresponding to the acoustic feature by using an automatic speech recognition (ASR) model. 10. The method according to claim 9 , wherein the ASR model is obtained through training according to training samples that comprise correspondences between speech segments and phonemes. 11. A speech-driven animation apparatus, deployed on an audio and video processing device, the apparatus comprising a processor and a memory, the memory being configured to store program code and transmit the program code to the processor; and when executing the program code, the processor being configured to: obtain a first speech with an acoustic feature, the first speech comprising a plurality of speech frames; determine linguistics information corresponding to a speech frame in the first speech by applying a neural network mapping model to extract the acoustic feature, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determine an expression parameter corresponding to the speech frame in the first speech according to the linguistics information, wherein the expression parameters do not reflect pronunciation habits of different speakers; and enable according to the expression parameter, an animation character to make an expression corresponding to the first speech. 12. The apparatus according to claim 11 , wherein a target speech frame is a speech frame in the first speech, and for the target speech frame, the processor is further configured to: determine a speech frame set in which the target speech frame is located, the speech frame set comprising the target speech frame and preceding and succeeding speech frames of the target speech frame; and determine an expression parameter corresponding to the target speech frame according to linguistics information corresponding to each speech frame in the speech frame set. 13. The method according to claim 1 , wherein the expression includes a mouth shape, a facial action, or a head posture. 14. A non-transitory computer-readable storage medium, configured to store program code, the program code, when being executed by a processor, causing the processor to perform: obtaining a first speech with an acoustic feature, the first speech comprising a plurality of speech frames; determining linguistics information corresponding to a speech frame in the first speech by applying a neural network mapping model to extract the acoustic feature, the linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes; determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information, wherein the expression parameters do not reflect pronunciation habits of different speakers; and enabling, according to the expression parameter, an animation character to make an expression corresponding to the first speech. 15. The computer-readable storage medium according to claim 14 , wherein a target speech frame is a speech frame in the first speech, and the determining an expression parameter corresponding to the speech frame in the first speech according to the linguistics information comprises: determining a speech frame set in which the target speech frame is located, the speech frame set comprising the target speech frame and the speech frames preceding and succeeding the target spe
Feedforward networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
driven by audio data · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.