End-to-end speech recognition
US-2017148431-A1 · May 25, 2017 · US
US11308671B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11308671-B2 |
| Application number | US-201916721772-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 19, 2019 |
| Priority date | Jun 28, 2019 |
| Publication date | Apr 19, 2022 |
| Grant date | Apr 19, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of the present disclosure relate to a method and apparatus for controlling mouth shape changes of a three-dimensional virtual portrait, relating to the field of cloud computing. The method may include: acquiring a to-be-played speech; sliding a preset time window at a preset step length in the to-be-played speech to obtain at least one speech segment; generating, based on the at least one speech segment, a mouth shape control parameter sequence for the to-be-played speech; and controlling, in response to playing the to-be-played speech, a preset mouth shape of the three-dimensional virtual portrait to change based on the mouth shape control parameter sequence.
Opening claim text (preview).
What is claimed is: 1. A method for controlling mouth shape changes of a three-dimensional virtual portrait, comprising: acquiring a to-be-played speech; sliding a preset time window at a preset step length in the to-be-played speech to obtain at least one speech segment; generating, based on the at least one speech segment, a mouth shape control parameter sequence for the to-be-played speech; and controlling, in response to playing the to-be-played speech, a preset mouth shape of the three-dimensional virtual portrait to change based on the mouth shape control parameter sequence, wherein the generating, based on the at least one speech segment, the mouth shape control parameter sequence for the to-be-played speech comprises: generating, for a speech segment of the at least one speech segment, a phoneme information sequence of the speech segment; inputting the phoneme information sequence composed of a plurality of pieces of phoneme information into a pre-established mouth shape key point predicting model to obtain a mouth shape key point information sequence composed of a plurality of pieces of mouth shape key point information, wherein the pre-established mouth shape key point predicting model is used to characterize a corresponding relationship between the phoneme information sequence and the mouth shape key point information sequence, wherein the mouth shape key point information indicates position information of a preset number of face key points related to a mouth shape, wherein inputting the phoneme information sequence composed of the plurality of pieces of phoneme information into the pre-established mouth shape key point predicting model to obtain the mouth shape key point information sequence composed of the plurality of pieces of mouth shape key point information comprises: outputting, by the pre-established mouth shape key point predicting model, a first piece of mouth shape key point information by using a first piece of phoneme information as a first input, and outputting, by the pre-established mouth shape key point predicting model, a second piece of mouth shape key point information by using a second piece of phoneme information and the first piece of mouth shape key point information as a second input; and generating, based on the mouth shape key point information sequence, the mouth shape control parameter sequence. 2. The method according to claim 1 , wherein the generating, based on the at least one speech segment, the mouth shape control parameter sequence for the to-be-played speech comprises: generating, based on the at least one speech segment, a two-dimensional feature matrix sequence, and inputting the two-dimensional feature matrix sequence into a pre-established convolutional neural network to obtain the mouth shape control parameter sequence, wherein the pre-established convolutional neural network is used to characterize corresponding relationships between two-dimensional feature matrices and mouth shape control parameters, wherein the generating, based on the at least one speech segment, the two-dimensional feature matrix sequence comprises: generating, for the speech segment of the at least one speech segment, at least one two-dimensional feature matrix for the speech segment; and splicing, based on an order of the at least one speech segment in the to-be-played speech, the generated at least one two-dimensional feature matrix into the two-dimensional feature matrix sequence. 3. The method according to claim 2 , wherein the generating, for the speech segment of the at least one speech segment, the at least one two-dimensional feature matrix for the speech segment comprises: dividing the speech segment into a preset number of speech sub-segments, wherein two adjacent speech sub-segments partially overlap; extracting, for a speech sub-segment in the preset number of speech sub-segments, a feature of the speech sub-segment to obtain a speech feature vector for the speech sub-segment; and generating, based on obtained preset number of speech feature vectors, the at least one two-dimensional feature matrix for the speech segment. 4. The method according to claim 1 , wherein the generating, based on the mouth shape key point information sequence, the mouth shape control parameter sequence comprises: obtaining, for mouth shape key point information in the mouth shape key point information sequence, at least one mouth shape control parameter corresponding to the mouth shape key point information based on a pre-established corresponding relationship between sample mouth shape key point information and a sample mouth shape control parameter; and generating the mouth shape control parameter sequence based on the obtained at least one mouth shape control parameter. 5. The method according to claim 1 , wherein the pre-established mouth shape key point predicting model is a recurrent neural network, and a loop body of the recurrent neural network is a long short-term memory. 6. The method according to claim 1 , wherein the pre-established mouth shape key point predicting model is a table storing a plurality of corresponding relationship between phoneme information sequences and mouth shape key point information sequences, wherein the table is determined based on statistics of a large number of the phoneme information sequences and the mouth shape key point information sequences. 7. The method according to claim 1 , wherein the pre-established mouth shape key point predicting model comprises a first sub-model and a second sub-model, wherein outputting, by the pre-established mouth shape key point predicting model, the first piece of mouth shape key point information by using the first piece of phoneme information as the first input, and outputting, by the pre-established mouth shape key point predicting model, the second piece of mouth shape key point information by using the second piece of phoneme information and the first piece of mouth shape key point information as the second input comprises: outputting, by the first sub-model, the first piece of mouth shape key point information by inputting the first piece of phoneme information into the first sub-model; and outputting, by the second sub-model, the second piece of mouth shape key point information by inputting the second piece of phoneme information and the first piece of mouth shape key point information into the second sub-model. 8. The method according to claim 1 , wherein the first piece of phoneme information is generated from a first speech segment of the speech segment, and the second piece of phoneme information is generated from a second speech segment of the speech segment, wherein the first speech segment is acquired before a second speed segment is acquired, and a part of the first speech segment is identical to a part of the second speech segment. 9. An apparatus for controlling mouth shape changes of a three-dimensional virtual portrait, comprising: at least one processor; and a memory storing instructions, the instructions when executed by the at least one processor, cause the at least one processor to perform operations, the operations comprising: acquiring a to-be-played speech; sliding a preset time window at a preset step length in the to-be-played speech to obtain at least one speech segment; generating, based on the at least one speech segment, a mouth shape control parameter sequence for the to-be-played speech; and controlling, in response to playing the to-be-played speech, a preset mouth shape of the three-dimensional virtual portrait to change based on the mouth shape control parameter sequence, wherein the generating, based on the at least one speech segment, the mouth shape control parameter sequence for the to-be-
Related publications grouped by family.
Answers are generated from the same data shown on this page.