Apparatus related to metric-learning-based data classification and method thereof
US-11568245-B2 · Jan 31, 2023 · US
US11620980B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11620980-B2 |
| Application number | US-202117178823-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 18, 2021 |
| Priority date | Jan 17, 2019 |
| Publication date | Apr 4, 2023 |
| Grant date | Apr 4, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A text-based speech synthesis method, a computer device, and a non-transitory computer-readable storage medium are provided. The text-based speech synthesis method includes: a target text to be recognized is obtained; each character in the target text is discretely characterized to generate a feature vector corresponding to each character; the feature vector is input into a pre-trained spectrum conversion model, to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and the Mel-spectrum is converted to speech to obtain speech corresponding to the target text.
Opening claim text (preview).
What is claimed is: 1. A text-based speech synthesis method, comprising: obtaining target text to be recognized; discretely characterizing each character in the target text to generate a feature vector corresponding to each character; obtaining a preset number of training text and matching speech corresponding to the training text; discretely characterizing the training text to obtain a feature vector corresponding to each character in the training text; inputting the feature vector corresponding to each character in the training text into a spectrum conversion model to be trained to obtain a Mel-spectrum output by the spectrum conversion model to be trained, wherein inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained comprises: coding the training text through the spectrum conversion model to be trained to obtain a hidden state sequence corresponding to the training text, wherein the hidden state sequence comprises at least two hidden nodes and is obtained by mapping the feature vectors of each character in the training text one by one; according to a weight of a hidden node corresponding to each character, weighting the hidden node to obtain a semantic vector corresponding to each character in the training text; and decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; when an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset threshold, obtaining the trained spectrum conversion model; inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and converting the Mel-spectrum into speech to obtain speech corresponding to the target text. 2. The method as claimed in claim 1 , further comprising after inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained: when the error between the Mel-spectrum output by the spectrum conversion model to be trained and the Mel-spectrum corresponding to the matching speech is greater than the preset threshold, updating the weight of each hidden node; weighting the hidden node whose weight is updated to obtain a semantic vector corresponding to each character in the training text; decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; and when the error between the Mel-spectrum corresponding to each character and the Mel-spectrum corresponding to the matching speech is less than or equal to the preset threshold, stopping the updating the weight of each hidden node, and obtaining the trained spectrum conversion model. 3. The method as claimed in claim 1 , wherein converting the Mel-spectrum into speech to obtain the speech corresponding to the target text comprises: performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech. 4. The method as claimed in claim 1 , wherein a number of characters in the training text corresponds to a number of hidden nodes. 5. A computer device, comprising: a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, causes the processor to implement: obtaining target text to be recognized; discretely characterizing each character in the target text to generate a feature vector corresponding to each character; obtaining a preset number of training text and matching speech corresponding to the training text; discretely characterizing the training text to obtain a feature vector corresponding to each character in the training text; inputting the feature vector corresponding to each character in the training text into a spectrum conversion model to be trained to obtain a Mel-spectrum output by the spectrum conversion model to be trained, wherein inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained comprises: coding the training text through the spectrum conversion model to be trained to obtain a hidden state sequence corresponding to the training text, wherein the hidden state sequence comprises at least two hidden nodes and is obtained by mapping the feature vectors of each character in the training text one by one; according to a weight of a hidden node corresponding to each character, weighting the hidden node to obtain a semantic vector corresponding to each character in the training text; and decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; when an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset threshold, obtaining the trained spectrum conversion model; inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and converting the Mel-spectrum into speech to obtain speech corresponding to the target text. 6. The computer device as claimed in claim 5 , wherein the computer program, when executed by the processor, further causes the processor to implement: after inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained: when the error between the Mel-spectrum output by the spectrum conversion model to be trained and the Mel-spectrum corresponding to the matching speech is greater than the preset threshold, updating the weight of each hidden node; weighting the hidden node whose weight is updated to obtain a semantic vector corresponding to each character in the training text; decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; and when the error between the Mel-spectrum corresponding to each character and the Mel-spectrum corresponding to the matching speech is less than or equal to the preset threshold, stopping the updating the weight of each hidden node, and obtaining the trained spectrum conversion model. 7. The computer device as claimed in claim 5 , wherein to implement converting the Mel-spectrum into speech to obtain the speech corresponding to the target text, the computer program, when executed by the processor, causes the processor to implement: performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech. 8. The computer device as claimed in claim 5 , wherein a number of characters in the training text corresponds to a number of hidden nodes. 9. A non-transitory computer-readable storage medium that stores a computer program, wherein the computer program, when executed by a processor, causes the processor to implement: obtaining target text to be recognized; discretely charac
Methods for producing synthetic speech; Speech synthesisers · CPC title
the extracted parameters being the cepstrum · CPC title
the extracted parameters being spectral information of each sub-band · CPC title
Architecture of speech synthesisers · CPC title
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.