Text-to-speech with emotional content

US9824681B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9824681-B2
Application numberUS-201414483153-A
CountryUS
Kind codeB2
Filing dateSep 11, 2014
Priority dateSep 11, 2014
Publication dateNov 21, 2017
Grant dateNov 21, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for converting text to speech having emotional content. In an aspect, an emotionally neutral acoustic trajectory is predicted for a script using a neutral model, and an emotion-specific acoustic trajectory adjustment is independently predicted using an emotion-specific model. The neutral trajectory and emotion-specific adjustments are combined to generate a transformed speech output having emotional content. In another aspect, state parameters of a statistical parametric model for neutral voice are transformed by emotion-specific factors that vary across contexts and states. The emotion-dependent adjustment factors may be clustered and stored using an emotion-specific decision tree or other clustering scheme distinct from a decision tree used for the neutral voice model.

First claim

Opening claim text (preview).

The invention claimed is: 1. An apparatus for text-to-speech conversion comprising: a neutral duration prediction block comprising computer hardware configured to generate an emotionally neutral representation of a script, the emotionally neutral representation comprising a neutral duration associated with each of a plurality of phonemes; and a duration adjustment block comprising computer hardware configured to apply a duration adjustment factor to each neutral duration to generate a transformed duration sequence, the duration adjustment factor being dependent on an emotion type and a linguistic-contextual identity of the corresponding phoneme; a neutral trajectory prediction block comprising computer hardware configured to generate a neutral fundamental frequency (F 0 ) prediction and a neutral spectrum prediction for each adjusted duration of the transformed duration sequence; and a trajectory adjustment block comprising computer hardware configured to apply an F 0 adjustment factor to each neutral F 0 prediction and a spectrum adjustment factor to each neutral spectrum prediction to generate a transformed representation, each of the F 0 adjustment factor and the spectrum adjustment factor being dependent on the emotion type and the linguistic-contextual identity of the corresponding phoneme. 2. The apparatus of claim 1 , further comprising a vocoder configured to synthesize a speech waveform from the transformed representation. 3. The apparatus of claim 1 , further comprising a memory storing a neutral decision tree and an emotion-specific decision tree distinct from the neutral decision tree, the neutral duration prediction block further configured to retrieve the duration of each phoneme from the neutral decision tree, and the duration adjustment block configured to retrieve an emotion-specific adjustment factor for adjusting each duration of each phoneme from the emotion-specific decision tree. 4. The apparatus of claim 1 , further comprising: a build block configured to build a phoneme sequence based on a text script; an extract block configured to modify the built phoneme sequence to generate a linguistic-contextual feature sequence based on extracted contextual features of the text script; wherein the plurality of phonemes of the neutral duration prediction block corresponds to the linguistic-contextual feature sequence. 5. The apparatus of claim 1 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied on a per-state basis. 6. The apparatus of claim 5 , each of the plurality of phonemes comprising three states. 7. The apparatus of claim 1 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied on a per-frame basis. 8. The apparatus of claim 1 , each of the duration adjustment factor, the F 0 adjustment factor, and the spectrum adjustment factor being applied additively. 9. The apparatus of claim 1 , each of the duration adjustment factor, the F 0 adjustment factor, and the spectrum adjustment factor being applied as a linear transformation. 10. The apparatus of claim 1 , each of the duration adjustment factor, the F 0 adjustment factor, and the spectrum adjustment factor being applied as an affine transformation. 11. A computing device including a memory holding instructions executable by a processor to: generate an emotionally neutral representation of a script, the emotionally neutral representation comprising a neutral duration associated with each of a plurality of phonemes; and apply a duration adjustment factor to each neutral duration to generate a transformed duration sequence, the duration adjustment factor being dependent on an emotion type and a linguistic-contextual identity of the corresponding phoneme; generate a neutral fundamental frequency (F 0 ) prediction and a neutral spectrum prediction for each adjusted duration of the transformed duration sequence; and apply an F 0 adjustment factor to each neutral F 0 prediction and a spectrum adjustment factor to each neutral spectrum prediction to generate a transformed representation, each of the F 0 adjustment factor and the spectrum adjustment factor being dependent on the emotion type and the linguistic-contextual identity of the corresponding phoneme. 12. The device of claim 11 , further comprising a vocoder configured to synthesize a speech waveform from the transformed representation. 13. The device of claim 11 , further comprising a memory storing a neutral decision tree and an emotion-specific decision tree distinct from the neutral decision tree, the neutral duration prediction block further configured to retrieve the duration of each phoneme from the neutral decision tree, and the duration adjustment block configured to retrieve an emotion-specific adjustment factor for adjusting each duration of each phoneme from the emotion-specific decision tree. 14. The device of claim 11 , the memory further holding instructions executable by the processor to: build a phoneme sequence based on a text script; modify the built phoneme sequence to generate a linguistic-contextual feature sequence based on extracted contextual features of the text script; wherein the plurality of phonemes of the neutral duration prediction block corresponds to the linguistic-contextual feature sequence. 15. The device of claim 11 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied on a per-state basis. 16. A method comprising: generating an emotionally neutral representation of a script, the emotionally neutral representation comprising a neutral duration associated with each of a plurality of phonemes; and applying a duration adjustment factor to each neutral duration to generate a transformed duration sequence, the duration adjustment factor being dependent on an emotion type and a linguistic-contextual identity of the corresponding phoneme; generating a neutral fundamental frequency (F 0 ) prediction and a neutral spectrum prediction for each adjusted duration of the transformed duration sequence; and applying an F 0 adjustment factor to each neutral F 0 prediction and a spectrum adjustment factor to each neutral spectrum prediction to generate a transformed representation, each of the F 0 adjustment factor and the spectrum adjustment factor being dependent on the emotion type and the linguistic-contextual identity of the corresponding phoneme. 17. The method of claim 16 , further comprising synthesizing a speech waveform from the transformed representation. 18. The method of claim 16 , further comprising: storing a neutral decision tree and an emotion-specific decision tree distinct from the neutral decision tree; retrieving the duration of each phoneme from the neutral decision tree, and the duration adjustment block configured to retrieve an emotion-specific adjustment factor for adjusting each duration of each phoneme from the emotion-specific decision tree. 19. The method of claim 16 , further comprising: building a phoneme sequence based on a text script; and modifying the built phoneme sequence to generate a linguistic-contextual feature sequence based on extracted contextual features of the text script; wherein the plurality of phonemes of the neutral duration prediction block corresponds to the linguistic-contextual feature sequence. 20. The method of claim 16 , each of the plurality of phonemes comprising a plurality of states, each of the adjustment factors applied

Assignees

Inventors

Classifications

  • G10L13/033Primary

    Voice editing, e.g. manipulating the voice of the synthesiser · CPC title

  • G10L13/027Primary

    Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9824681B2 cover?
Techniques for converting text to speech having emotional content. In an aspect, an emotionally neutral acoustic trajectory is predicted for a script using a neutral model, and an emotion-specific acoustic trajectory adjustment is independently predicted using an emotion-specific model. The neutral trajectory and emotion-specific adjustments are combined to generate a transformed speech output …
Who is the assignee on this patent?
Microsoft Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 21 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).