Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium

US11620980B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11620980-B2
Application numberUS-202117178823-A
CountryUS
Kind codeB2
Filing dateFeb 18, 2021
Priority dateJan 17, 2019
Publication dateApr 4, 2023
Grant dateApr 4, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A text-based speech synthesis method, a computer device, and a non-transitory computer-readable storage medium are provided. The text-based speech synthesis method includes: a target text to be recognized is obtained; each character in the target text is discretely characterized to generate a feature vector corresponding to each character; the feature vector is input into a pre-trained spectrum conversion model, to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and the Mel-spectrum is converted to speech to obtain speech corresponding to the target text.

First claim

Opening claim text (preview).

What is claimed is: 1. A text-based speech synthesis method, comprising: obtaining target text to be recognized; discretely characterizing each character in the target text to generate a feature vector corresponding to each character; obtaining a preset number of training text and matching speech corresponding to the training text; discretely characterizing the training text to obtain a feature vector corresponding to each character in the training text; inputting the feature vector corresponding to each character in the training text into a spectrum conversion model to be trained to obtain a Mel-spectrum output by the spectrum conversion model to be trained, wherein inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained comprises: coding the training text through the spectrum conversion model to be trained to obtain a hidden state sequence corresponding to the training text, wherein the hidden state sequence comprises at least two hidden nodes and is obtained by mapping the feature vectors of each character in the training text one by one; according to a weight of a hidden node corresponding to each character, weighting the hidden node to obtain a semantic vector corresponding to each character in the training text; and decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; when an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset threshold, obtaining the trained spectrum conversion model; inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and converting the Mel-spectrum into speech to obtain speech corresponding to the target text. 2. The method as claimed in claim 1 , further comprising after inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained: when the error between the Mel-spectrum output by the spectrum conversion model to be trained and the Mel-spectrum corresponding to the matching speech is greater than the preset threshold, updating the weight of each hidden node; weighting the hidden node whose weight is updated to obtain a semantic vector corresponding to each character in the training text; decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; and when the error between the Mel-spectrum corresponding to each character and the Mel-spectrum corresponding to the matching speech is less than or equal to the preset threshold, stopping the updating the weight of each hidden node, and obtaining the trained spectrum conversion model. 3. The method as claimed in claim 1 , wherein converting the Mel-spectrum into speech to obtain the speech corresponding to the target text comprises: performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech. 4. The method as claimed in claim 1 , wherein a number of characters in the training text corresponds to a number of hidden nodes. 5. A computer device, comprising: a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, causes the processor to implement: obtaining target text to be recognized; discretely characterizing each character in the target text to generate a feature vector corresponding to each character; obtaining a preset number of training text and matching speech corresponding to the training text; discretely characterizing the training text to obtain a feature vector corresponding to each character in the training text; inputting the feature vector corresponding to each character in the training text into a spectrum conversion model to be trained to obtain a Mel-spectrum output by the spectrum conversion model to be trained, wherein inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained comprises: coding the training text through the spectrum conversion model to be trained to obtain a hidden state sequence corresponding to the training text, wherein the hidden state sequence comprises at least two hidden nodes and is obtained by mapping the feature vectors of each character in the training text one by one; according to a weight of a hidden node corresponding to each character, weighting the hidden node to obtain a semantic vector corresponding to each character in the training text; and decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; when an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset threshold, obtaining the trained spectrum conversion model; inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and converting the Mel-spectrum into speech to obtain speech corresponding to the target text. 6. The computer device as claimed in claim 5 , wherein the computer program, when executed by the processor, further causes the processor to implement: after inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained: when the error between the Mel-spectrum output by the spectrum conversion model to be trained and the Mel-spectrum corresponding to the matching speech is greater than the preset threshold, updating the weight of each hidden node; weighting the hidden node whose weight is updated to obtain a semantic vector corresponding to each character in the training text; decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; and when the error between the Mel-spectrum corresponding to each character and the Mel-spectrum corresponding to the matching speech is less than or equal to the preset threshold, stopping the updating the weight of each hidden node, and obtaining the trained spectrum conversion model. 7. The computer device as claimed in claim 5 , wherein to implement converting the Mel-spectrum into speech to obtain the speech corresponding to the target text, the computer program, when executed by the processor, causes the processor to implement: performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech. 8. The computer device as claimed in claim 5 , wherein a number of characters in the training text corresponds to a number of hidden nodes. 9. A non-transitory computer-readable storage medium that stores a computer program, wherein the computer program, when executed by a processor, causes the processor to implement: obtaining target text to be recognized; discretely charac

Assignees

Inventors

Classifications

  • G10L13/02Primary

    Methods for producing synthetic speech; Speech synthesisers · CPC title

  • the extracted parameters being the cepstrum · CPC title

  • the extracted parameters being spectral information of each sub-band · CPC title

  • Architecture of speech synthesisers · CPC title

  • G10L13/08Primary

    Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11620980B2 cover?
A text-based speech synthesis method, a computer device, and a non-transitory computer-readable storage medium are provided. The text-based speech synthesis method includes: a target text to be recognized is obtained; each character in the target text is discretely characterized to generate a feature vector corresponding to each character; the feature vector is input into a pre-trained spectrum…
Who is the assignee on this patent?
Ping An Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L13/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 04 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).