Data processing method, and storage medium and electronic device thereof
US-2024339107-A1 · Oct 10, 2024 · US
US10140972B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10140972-B2 |
| Application number | US-201414466340-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 22, 2014 |
| Priority date | Aug 23, 2013 |
| Publication date | Nov 27, 2018 |
| Grant date | Nov 27, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of training an acoustic model for a text-to-speech system, the method comprising: receiving speech data; said speech data comprising data corresponding to different values of a first speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
Opening claim text (preview).
The invention claimed is: 1. A text-to-speech method configured to output speech having a target value of a speech factor, said method comprising: inputting audio data with said target value of a speech factor; adapting an acoustic model to said target value of a speech factor; inputting text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and outputting said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a set of speech factor parameters that enable the acoustic model to accommodate speech for the different values of a speech factor, and wherein said set of speech factor parameters are unlabeled, such that for a given one or more parameters, the value of said speech factor to which they relate is unknown, and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor, wherein said text to speech method includes training said acoustic model using a method comprising: receiving speech data, said speech data comprising data corresponding to different values of the speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown; clustering said speech data according to the value of said speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion. 2. The text to speech method of claim 1 , wherein said speech factor is expression and the acoustic model further comprises a set of expression parameters relating to speaker and a set of clusters relating to speaker; and wherein said set of expression parameters and said set of speaker parameters and said set of expression clusters and said set of speaker clusters do not overlap, and wherein the method is configured to transplant an expression from a first speaker to a second speaker, by employing expression parameters obtained from the speech of a first speaker with that of a second speaker. 3. A text to speech method, the method comprising: receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a set of speaker parameters and a set of speaker clusters relating to speaker voice and a set of expression parameters and a set of expression clusters relating to expression, and wherein the sets of speaker and expression parameters and the sets of speaker and expression clusters do not overlap; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said parameters relating to expression by: extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising: receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering. 4. The method of claim 3 wherein said second space is the acoustic space of a first speaker and the method is configured to transplant the expressive synthesis feature vector to the acoustic space of a second speaker. 5. A text to speech method, the method comprising: receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said second set of parameters by: extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising: receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering, and wherein said first and second set of parameters and said first and second set of clusters do not overlap. 6. A system configured to output speech having a target value of a speech factor, said system comprising: an audio input for receiving audio data with said target value of a speech factor; a text input for receiving text; and a processor configured to adapt an acoustic model to said target value of a speech factor; divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and output said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a first set of parameters that enable to acoustic model to accommodate speech for the different values of the speech factor; and wherein said first set of parameters are unlabeled, such that for a given one or more parameters, the value of said first speech factor is unknown, and wherein adapting the acoustic mod
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice (G10L15/14 takes precedence) · CPC title
Methods for producing synthetic speech; Speech synthesisers · CPC title
Prosody rules derived from text; Stress or intonation · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.