Singing voice synthesis method and singing voice synthesis system
US-2020105244-A1 · Apr 2, 2020 · US
US11705105B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11705105-B2 |
| Application number | US-201916500021-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 15, 2019 |
| Priority date | May 15, 2019 |
| Publication date | Jul 18, 2023 |
| Grant date | Jul 18, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence includes a database configured to store a synthesized speech corresponding to text, a correct speech corresponding to the text and a speech quality evaluation model for evaluating the quality of the synthesized speech, and a processor configured to compare a first speech feature set indicating a feature of the synthesized speech and a second speech feature set indicating a feature of the correct speech, acquire a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of comparison, and determine weights as model parameters of the speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model.
Opening claim text (preview).
The invention claimed is: 1. A speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence, the speech synthesizer comprising: a data base configured to store a synthesized speech corresponding to text, a correct speech corresponding to the text and a speech quality evaluation model for evaluating the quality of the synthesized speech; and a processor configured to: compare a first speech feature set indicating a feature of the synthesized speech and a second speech feature set indicating a feature of the correct speech, wherein each of the first speech feature set and the second speech feature set includes a pitch of voiceless sound of a speech, a pitch of voiced sound of the speech, a frequency band of the speech, a break index of a word configuring the speech, a pitch of the speech, an utterance speed of the speech, or a pitch contour of the speech, acquire a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of the comparing, wherein the quality evaluation index set includes an FO Frame Error (FFE), a Gross Pitch Error (GPE), a Voicing Decision Error (VDE), a Mel Cepstral Distortion (MCD), a Formant Distance (FD), a Speaker Verification Error (SVE), a Break Index Error (BIE) and a Word Error (WE), and determine weights as model parameters of the speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model, wherein the processor differently determines the weights according to a synthesis purpose of the synthesized speech and updates the speech quality evaluation model based on the weights to generate an updated speech quality evaluation model, wherein a weight of the GPE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a normal synthesis for maintaining a tone, wherein a weight of the VDE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is an emotional synthesis for outputting an emotional synthesis speech, wherein a weight of the FFE and a weight of the MCD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a personalization synthesis for outputting the synthesized speech suiting a tone of a specific speaker, and wherein the updated speech quality evaluation model is applied to recognize a wake-up word for activating speech recognition or to generate the synthesized speech from the text. 2. The speech synthesizer according to claim 1 , wherein the speech quality evaluation model is an artificial neural network based model learned using a machine learning algorithm or a deep learning algorithm. 3. The speech synthesizer according to claim 2 , wherein the speech quality evaluation model is a model supervised-learned using the quality evaluation index set and user's satisfaction labeled with the quality evaluation index set. 4. The speech synthesizer according to claim 3 , wherein the processor extracts an input feature vector from the quality evaluation index set, inputs the extracted input feature vector to the speech quality evaluation model, and learns the speech quality evaluation model to minimize a cost function corresponding to a difference between output user's satisfaction and the labeled user's satisfaction when a result of inferring the labeled user's satisfaction is output as a target feature vector. 5. The speech synthesizer according to claim 1 , wherein, when a new synthesized speech is input to the speech quality evaluation model, the processor outputs user's satisfaction using a determined weight set and evaluates a quality level of the synthesized speech based on the output user's satisfaction. 6. A method of operating a speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence, the method comprising: comparing, by a processor in the speech synthesizer, a first speech feature set indicating a feature of a synthesized speech stored in a database and a second speech feature set indicating a feature of a correct speech stored in the database, wherein each of the first speech feature set and the second speech feature set includes a pitch of voiceless sound of a speech, a pitch of voiced sound of the speech, a frequency band of the speech, a break index of a word configuring the speech, a pitch of the speech, an utterance speed of the speech or a pitch contour of the speech; acquiring, by the processor, a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of the comparing, wherein the quality evaluation index set includes an F 0 Frame Error (FFE), a Gross Pitch Error (GPE), a Voicing Decision Error (VDE), a Mel Cepstral Distortion (MCD), a Formant Distance (FD), a Speaker Verification Error (SVE), a Break Index Error (BIE) and a Word Error (WE); and determining, by the processor, weights as model parameters of a speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model, wherein the weights are differently determined, by the processor, according to a synthesis purpose of the synthesized speech and the processor updates the speech quality evaluation model based on the weights to generate an updated speech quality evaluation model, wherein a weight of the GPE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a normal synthesis for maintaining a tone, wherein a weight of the VDE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is an emotional synthesis for outputting an emotional synthesis speech, wherein a weight of the FFE and a weight of the MCD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a personalization synthesis for outputting the synthesized speech suiting a tone of a specific speaker, and wherein the updated speech quality evaluation model is applied to recognize a wake-up word for activating speech recognition or to generate the synthesized speech from text. 7. The method according to claim 6 , wherein the speech quality evaluation model is an artificial neural network based model learned using a machine learning algorithm or a deep learning algorithm, and wherein the speech quality evaluation model is a model supervised-learned using the quality evaluation index set and user's satisfaction labeled with the quality evaluation index set. 8. The method according to claim 7 , further comprising: extracting an input feature vector from the quality evaluation index set; inputting the extracted input feature vector to the speech quality evaluation model; outputting a result of inferring the labeled user's satisfaction as a target feature vector; and learning the speech quality evaluation model to minimize a cost function corresponding to a difference between output user's satisfaction and the labeled user's satisfaction. 9. The method according to claim 6 , further comprising, when a new synthesized speech is input to the speech quality evaluation model, outputting user's satisfaction using a determined weight set and evaluating a quality level of the synthesized speech based on the output user's satisfaction.
Feedforward networks · CPC title
Supervised learning · CPC title
Methods for producing synthetic speech; Speech synthesisers · CPC title
Learning methods · CPC title
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.