Voice font speaker and prosody interpolation
US-9472182-B2 · Oct 18, 2016 · US
US9824681B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9824681-B2 |
| Application number | US-201414483153-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 11, 2014 |
| Priority date | Sep 11, 2014 |
| Publication date | Nov 21, 2017 |
| Grant date | Nov 21, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for converting text to speech having emotional content. In an aspect, an emotionally neutral acoustic trajectory is predicted for a script using a neutral model, and an emotion-specific acoustic trajectory adjustment is independently predicted using an emotion-specific model. The neutral trajectory and emotion-specific adjustments are combined to generate a transformed speech output having emotional content. In another aspect, state parameters of a statistical parametric model for neutral voice are transformed by emotion-specific factors that vary across contexts and states. The emotion-dependent adjustment factors may be clustered and stored using an emotion-specific decision tree or other clustering scheme distinct from a decision tree used for the neutral voice model.
Opening claim text (preview).
The invention claimed is: 1. An apparatus for text-to-speech conversion comprising: a neutral duration prediction block comprising computer hardware configured to generate an emotionally neutral representation of a script, the emotionally neutral representation comprising a neutral duration associated with each of a plurality of phonemes; and a duration adjustment block comprising computer hardware configured to apply a duration adjustment factor to each neutral duration to generate a transformed duration sequence, the duration adjustment factor being dependent on an emotion type and a linguistic-contextual identity of the corresponding phoneme; a neutral trajectory prediction block comprising computer hardware configured to generate a neutral fundamental frequency (F 0 ) prediction and a neutral spectrum prediction for each adjusted duration of the transformed duration sequence; and a trajectory adjustment block comprising computer hardware configured to apply an F 0 adjustment factor to each neutral F 0 prediction and a spectrum adjustment factor to each neutral spectrum prediction to generate a transformed representation, each of the F 0 adjustment factor and the spectrum adjustment factor being dependent on the emotion type and the linguistic-contextual identity of the corresponding phoneme. 2. The apparatus of claim 1 , further comprising a vocoder configured to synthesize a speech waveform from the transformed representation. 3. The apparatus of claim 1 , further comprising a memory storing a neutral decision tree and an emotion-specific decision tree distinct from the neutral decision tree, the neutral duration prediction block further configured to retrieve the duration of each phoneme from the neutral decision tree, and the duration adjustment block configured to retrieve an emotion-specific adjustment factor for adjusting each duration of each phoneme from the emotion-specific decision tree. 4. The apparatus of claim 1 , further comprising: a build block configured to build a phoneme sequence based on a text script; an extract block configured to modify the built phoneme sequence to generate a linguistic-contextual feature sequence based on extracted contextual features of the text script; wherein the plurality of phonemes of the neutral duration prediction block corresponds to the linguistic-contextual feature sequence. 5. The apparatus of claim 1 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied on a per-state basis. 6. The apparatus of claim 5 , each of the plurality of phonemes comprising three states. 7. The apparatus of claim 1 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied on a per-frame basis. 8. The apparatus of claim 1 , each of the duration adjustment factor, the F 0 adjustment factor, and the spectrum adjustment factor being applied additively. 9. The apparatus of claim 1 , each of the duration adjustment factor, the F 0 adjustment factor, and the spectrum adjustment factor being applied as a linear transformation. 10. The apparatus of claim 1 , each of the duration adjustment factor, the F 0 adjustment factor, and the spectrum adjustment factor being applied as an affine transformation. 11. A computing device including a memory holding instructions executable by a processor to: generate an emotionally neutral representation of a script, the emotionally neutral representation comprising a neutral duration associated with each of a plurality of phonemes; and apply a duration adjustment factor to each neutral duration to generate a transformed duration sequence, the duration adjustment factor being dependent on an emotion type and a linguistic-contextual identity of the corresponding phoneme; generate a neutral fundamental frequency (F 0 ) prediction and a neutral spectrum prediction for each adjusted duration of the transformed duration sequence; and apply an F 0 adjustment factor to each neutral F 0 prediction and a spectrum adjustment factor to each neutral spectrum prediction to generate a transformed representation, each of the F 0 adjustment factor and the spectrum adjustment factor being dependent on the emotion type and the linguistic-contextual identity of the corresponding phoneme. 12. The device of claim 11 , further comprising a vocoder configured to synthesize a speech waveform from the transformed representation. 13. The device of claim 11 , further comprising a memory storing a neutral decision tree and an emotion-specific decision tree distinct from the neutral decision tree, the neutral duration prediction block further configured to retrieve the duration of each phoneme from the neutral decision tree, and the duration adjustment block configured to retrieve an emotion-specific adjustment factor for adjusting each duration of each phoneme from the emotion-specific decision tree. 14. The device of claim 11 , the memory further holding instructions executable by the processor to: build a phoneme sequence based on a text script; modify the built phoneme sequence to generate a linguistic-contextual feature sequence based on extracted contextual features of the text script; wherein the plurality of phonemes of the neutral duration prediction block corresponds to the linguistic-contextual feature sequence. 15. The device of claim 11 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied on a per-state basis. 16. A method comprising: generating an emotionally neutral representation of a script, the emotionally neutral representation comprising a neutral duration associated with each of a plurality of phonemes; and applying a duration adjustment factor to each neutral duration to generate a transformed duration sequence, the duration adjustment factor being dependent on an emotion type and a linguistic-contextual identity of the corresponding phoneme; generating a neutral fundamental frequency (F 0 ) prediction and a neutral spectrum prediction for each adjusted duration of the transformed duration sequence; and applying an F 0 adjustment factor to each neutral F 0 prediction and a spectrum adjustment factor to each neutral spectrum prediction to generate a transformed representation, each of the F 0 adjustment factor and the spectrum adjustment factor being dependent on the emotion type and the linguistic-contextual identity of the corresponding phoneme. 17. The method of claim 16 , further comprising synthesizing a speech waveform from the transformed representation. 18. The method of claim 16 , further comprising: storing a neutral decision tree and an emotion-specific decision tree distinct from the neutral decision tree; retrieving the duration of each phoneme from the neutral decision tree, and the duration adjustment block configured to retrieve an emotion-specific adjustment factor for adjusting each duration of each phoneme from the emotion-specific decision tree. 19. The method of claim 16 , further comprising: building a phoneme sequence based on a text script; and modifying the built phoneme sequence to generate a linguistic-contextual feature sequence based on extracted contextual features of the text script; wherein the plurality of phonemes of the neutral duration prediction block corresponds to the linguistic-contextual feature sequence. 20. The method of claim 16 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.