Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same

US11705105B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11705105-B2
Application numberUS-201916500021-A
CountryUS
Kind codeB2
Filing dateMay 15, 2019
Priority dateMay 15, 2019
Publication dateJul 18, 2023
Grant dateJul 18, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence includes a database configured to store a synthesized speech corresponding to text, a correct speech corresponding to the text and a speech quality evaluation model for evaluating the quality of the synthesized speech, and a processor configured to compare a first speech feature set indicating a feature of the synthesized speech and a second speech feature set indicating a feature of the correct speech, acquire a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of comparison, and determine weights as model parameters of the speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model.

First claim

Opening claim text (preview).

The invention claimed is: 1. A speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence, the speech synthesizer comprising: a data base configured to store a synthesized speech corresponding to text, a correct speech corresponding to the text and a speech quality evaluation model for evaluating the quality of the synthesized speech; and a processor configured to: compare a first speech feature set indicating a feature of the synthesized speech and a second speech feature set indicating a feature of the correct speech, wherein each of the first speech feature set and the second speech feature set includes a pitch of voiceless sound of a speech, a pitch of voiced sound of the speech, a frequency band of the speech, a break index of a word configuring the speech, a pitch of the speech, an utterance speed of the speech, or a pitch contour of the speech, acquire a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of the comparing, wherein the quality evaluation index set includes an FO Frame Error (FFE), a Gross Pitch Error (GPE), a Voicing Decision Error (VDE), a Mel Cepstral Distortion (MCD), a Formant Distance (FD), a Speaker Verification Error (SVE), a Break Index Error (BIE) and a Word Error (WE), and determine weights as model parameters of the speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model, wherein the processor differently determines the weights according to a synthesis purpose of the synthesized speech and updates the speech quality evaluation model based on the weights to generate an updated speech quality evaluation model, wherein a weight of the GPE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a normal synthesis for maintaining a tone, wherein a weight of the VDE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is an emotional synthesis for outputting an emotional synthesis speech, wherein a weight of the FFE and a weight of the MCD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a personalization synthesis for outputting the synthesized speech suiting a tone of a specific speaker, and wherein the updated speech quality evaluation model is applied to recognize a wake-up word for activating speech recognition or to generate the synthesized speech from the text. 2. The speech synthesizer according to claim 1 , wherein the speech quality evaluation model is an artificial neural network based model learned using a machine learning algorithm or a deep learning algorithm. 3. The speech synthesizer according to claim 2 , wherein the speech quality evaluation model is a model supervised-learned using the quality evaluation index set and user's satisfaction labeled with the quality evaluation index set. 4. The speech synthesizer according to claim 3 , wherein the processor extracts an input feature vector from the quality evaluation index set, inputs the extracted input feature vector to the speech quality evaluation model, and learns the speech quality evaluation model to minimize a cost function corresponding to a difference between output user's satisfaction and the labeled user's satisfaction when a result of inferring the labeled user's satisfaction is output as a target feature vector. 5. The speech synthesizer according to claim 1 , wherein, when a new synthesized speech is input to the speech quality evaluation model, the processor outputs user's satisfaction using a determined weight set and evaluates a quality level of the synthesized speech based on the output user's satisfaction. 6. A method of operating a speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence, the method comprising: comparing, by a processor in the speech synthesizer, a first speech feature set indicating a feature of a synthesized speech stored in a database and a second speech feature set indicating a feature of a correct speech stored in the database, wherein each of the first speech feature set and the second speech feature set includes a pitch of voiceless sound of a speech, a pitch of voiced sound of the speech, a frequency band of the speech, a break index of a word configuring the speech, a pitch of the speech, an utterance speed of the speech or a pitch contour of the speech; acquiring, by the processor, a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of the comparing, wherein the quality evaluation index set includes an F 0 Frame Error (FFE), a Gross Pitch Error (GPE), a Voicing Decision Error (VDE), a Mel Cepstral Distortion (MCD), a Formant Distance (FD), a Speaker Verification Error (SVE), a Break Index Error (BIE) and a Word Error (WE); and determining, by the processor, weights as model parameters of a speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model, wherein the weights are differently determined, by the processor, according to a synthesis purpose of the synthesized speech and the processor updates the speech quality evaluation model based on the weights to generate an updated speech quality evaluation model, wherein a weight of the GPE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a normal synthesis for maintaining a tone, wherein a weight of the VDE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is an emotional synthesis for outputting an emotional synthesis speech, wherein a weight of the FFE and a weight of the MCD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a personalization synthesis for outputting the synthesized speech suiting a tone of a specific speaker, and wherein the updated speech quality evaluation model is applied to recognize a wake-up word for activating speech recognition or to generate the synthesized speech from text. 7. The method according to claim 6 , wherein the speech quality evaluation model is an artificial neural network based model learned using a machine learning algorithm or a deep learning algorithm, and wherein the speech quality evaluation model is a model supervised-learned using the quality evaluation index set and user's satisfaction labeled with the quality evaluation index set. 8. The method according to claim 7 , further comprising: extracting an input feature vector from the quality evaluation index set; inputting the extracted input feature vector to the speech quality evaluation model; outputting a result of inferring the labeled user's satisfaction as a target feature vector; and learning the speech quality evaluation model to minimize a cost function corresponding to a difference between output user's satisfaction and the labeled user's satisfaction. 9. The method according to claim 6 , further comprising, when a new synthesized speech is input to the speech quality evaluation model, outputting user's satisfaction using a determined weight set and evaluating a quality level of the synthesized speech based on the output user's satisfaction.

Assignees

Inventors

Classifications

  • Feedforward networks · CPC title

  • Supervised learning · CPC title

  • G10L13/02Primary

    Methods for producing synthetic speech; Speech synthesisers · CPC title

  • Learning methods · CPC title

  • G10L13/033Primary

    Voice editing, e.g. manipulating the voice of the synthesiser · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11705105B2 cover?
A speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence includes a database configured to store a synthesized speech corresponding to text, a correct speech corresponding to the text and a speech quality evaluation model for evaluating the quality of the synthesized speech, and a processor configured to compare a first speech feature set indicating a fe…
Who is the assignee on this patent?
Lg Electronics Inc
What technology area does this patent fall under?
Primary CPC classification G10L13/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 18 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).