Method and system for achieving emotional text to speech utilizing emotion tags assigned to text data

US9117446B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9117446-B2
Application numberUS-201113221953-A
CountryUS
Kind codeB2
Filing dateAug 31, 2011
Priority dateAug 31, 2010
Publication dateAug 25, 2015
Grant dateAug 25, 2015

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for achieving emotional Text To Speech (TTS), the method comprising: receiving a set of text data; organizing each of a plurality of words in the set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and performing, by at least one processor of at least one computing device, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and determining for each of the set of phones a speech feature based on: F i =(1− P emotion )* F i-neutral +P emotion *F i-emotion wherein: F i is a value of an i th speech feature of one of the plurality of phones, P emotion is the final emotion score of the rhythm piece where one of the plurality of phones lies, F i-neutral is a first speech feature value of an i th speech feature in a neutral emotion category, and F i-emotion is a second speech feature value of an i th speech feature in the final emotion category. 2. The method according to claim 1 , wherein determining the final emotion score comprises: designating the final emotion score as an emotion score in the plurality of emotion scores comprising. 3. The method according to claim 1 , further comprising: adjusting, for at least one of the plurality of rhythm pieces, at least one emotion score in the plurality of emotion scores according to a context of the rhythm piece; and determining the final emotion score and the final emotion category of the rhythm piece based on the plurality of emotion scores comprising the at least one emotion score that has been adjusted. 4. The method according to claim 3 , wherein adjusting the at least one emotion score further comprises: adjusting the at least one emotion score based on an emotion vector adjustment decision tree, wherein the emotion vector adjustment decision tree is established based on emotion vector adjustment training data. 5. The method according to claim 1 , further comprising: applying emotion smoothing to the set of text data based on the emotion tags generated for the plurality of rhythm pieces. 6. The method according to claim 5 , wherein applying emotion smoothing comprises: obtaining an adjacent probability that a first emotion category associated with a first of the plurality of rhythm pieces is connected to a second emotion category of a second of the plurality of rhythm pieces that is adjacent to the first of the plurality of rhythm pieces; determining a final emotion path of the set of text data based on the adjacent probability and a plurality of emotion scores of corresponding emotion categories; and determining the final emotion category of each of the plurality of rhythm pieces based on the final emotion path. 7. The method according to claim 6 , further comprising: determining the final emotion score from the final emotion category, wherein the final emotion score has a highest value in the plurality of emotion scores. 8. The method according to claim 6 , wherein obtaining an adjacent probability further comprises: performing a statistical analysis on emotion adjacent training data, wherein the statistical analysis records a number of times where at least two of the plurality of emotion categories had been adjacent in the emotion adjacent training data. 9. The method according to claim 8 , further comprising: expanding the emotion adjacent training data based on the formed final emotion path. 10. The method according to claim 8 , further comprising: expanding the emotion adjacent training data by connecting at least one of the plurality of emotion categories with a highest value in the plurality of emotion scores. 11. The method according to claim 1 , wherein determining for each of the set of phones a speech feature further comprises: determining if the final emotion score of the rhythm piece where the phone lies is greater than a certain threshold, based on: F i =F i-emotion . 12. The method according to claim 1 , wherein determining for each of the set of phones a speech feature further comprises: determining if the final emotion score of the rhythm piece where one the phone lies is smaller than a certain threshold, based on: F i =F i-neutral . 13. The method according to claim 1 , wherein the speech feature comprises at least one of: a basic frequency feature, a frequency spectrum feature, a time length feature, and a combination thereof. 14. A system for achieving emotional Text To Speech (TTS), comprising: at least one memory; and at least one processor communicatively coupled to the at least one memory, the at least one processor configured to perform a method comprising: receiving a set of text data; organizing the set of text data into a plurality of rhythm pieces; generating an emotion tag for each of the plurality of rhythm pieces, wherein each emotion tag is expressed as a set of emotion vectors, each emotion vector comprising a plurality of emotion scores, where each of the plurality of emotion scores is assigned to a different emotion category in a plurality of emotion categories; determining, for each of the plurality of rhythm pieces, a final emotion score for the rhythm piece based on at least each of the plurality of emotion scores; determining, for each of the plurality of rhythm pieces, a final emotional category for the rhythm piece based on at least each of the plurality of emotion categories; and performing, TTS of the set of text data utilizing each of the emotion tags, where performing TTS comprises decomposing at least one rhythm piece in the plurality of rhythm pieces into a set of phones; and determining for each of the set of phones a speech feature based on: F i =(1− P emotion )* F i-neutral +P emotion *F i-emotion wherein: F i is a value of an i th speech feature of one of the plurality of phones, P emotion is the final emotion score of the rhythm piece where one of the plurality of phones lies, F i-neutral is a first speech feature value of an i th speech feature in a neutral emotion category, and F i-emotion is a second speech feature value of an i th speech feature in the final emotion category. 15. The system of claim 14 , wherein determining the final emotion score comprises: designating the final emotion score as an emotion score in the plurality of emotion scores comprising a highest value. 16. The system of claim 14 , wherein the method further comprises: adjusting, for at least one of the plurality of rhythm pieces, at least one emotion score in the plurality of emotion scores according to a context of the rhythm piece; and determining the final emotion score and the final emotion category of the rhythm piece based on the plurality of emotion scores comprising the at least one emotion score that has been adjusted.

Assignees

Inventors

Classifications

  • G10L13/10Primary

    Prosody rules derived from text; Stress or intonation · CPC title

  • Methods for producing synthetic speech; Speech synthesisers · CPC title

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9117446B2 cover?
A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emoti…
Who is the assignee on this patent?
Bao Shenghua, Chen Jian, Qin Yong, and 6 more
What technology area does this patent fall under?
Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 25 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).