Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L13/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 18 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Voice font speaker and prosody interpolation

US9472182B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9472182-B2
Application number	US-201414190875-A
Country	US
Kind code	B2
Filing date	Feb 26, 2014
Priority date	Feb 26, 2014
Publication date	Oct 18, 2016
Grant date	Oct 18, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Multi-voice font interpolation is provided. A multi-voice font interpolation engine allows the production of computer generated speech with a wide variety of speaker characteristics and/or prosody by interpolating speaker characteristics and prosody from existing fonts. Using prediction models from multiple voice fonts, the multi-voice font interpolation engine predicts values for the parameters that influence speaker characteristics and/or prosody for the phoneme sequence obtained from the text to spoken. For each parameter, additional parameter values are generated by a weighted interpolation from the predicted values. Modifying an existing voice font with the interpolated parameters changes the style and/or emotion of the speech while retaining the base sound qualities of the original voice. The multi-voice font interpolation engine allows the speaker characteristics and/or prosody to be transplanted from one voice font to another or entirely new speaker characteristics and/or prosody to be generated for an existing voice font.

First claim

Opening claim text (preview).

What is claimed is: 1. A method allowing computer-generated speech to be rendered with a multi-voice font that is different than source voice fonts used to generate the multi-voice font, the method comprising the acts of: loading the source voice fonts; assigning weights to characteristics of each source voice font; obtaining text to be rendered as the computer-generated speech; predicting characteristic values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font; merging the predicted characteristic values with the corresponding weights to produce interpolated characteristic values; and rendering the text as computer-generated speech having the interpolated characteristic values. 2. The method of claim 1 wherein the act of merging the predicted characteristic values with the corresponding weights further comprises the acts of: multiplying the predicted characteristic values by the weight for the characteristic given to the source voice font used to predict the predicted characteristic values; and summing the weighted characteristic values to produce the interpolated characteristic values. 3. The method of claim 1 wherein the act of assigning weights to characteristics of each source voice font further comprises the acts of: assigning a duration weight to each source voice font; assigning a f0 weight to each source voice font; and assigning a spectrum weight to each source voice font. 4. The method of claim 3 wherein the act of predicting characteristic values for the text using each source voice font further comprises the act of predicting duration values, voiced/unvoiced probability values, f0 values, and spectral trajectory values for the text using each source voice font. 5. The method of claim 3 wherein the act of obtaining text to be rendered as the computer-generated speech further comprises the act of parsing the text into a sequence of phonemes dividable into frames. 6. The method of claim 5 wherein the act of predicting characteristic values for the text using each source voice font further comprises the acts of: predicting a duration value and a voiced/unvoiced probability value for each phoneme using each source voice font; and predicting a f0 value and a spectral trajectory value for each frame of each phoneme using each source voice font. 7. The method of claim 5 wherein the act of merging the predicted characteristic values with the corresponding weights further comprises the acts of: interpolating a duration value for each phoneme using the duration weights; interpolating an voiced/unvoiced probability value for each phoneme using the spectrum weights; making a voiced/unvoiced decision for each phoneme based on the voiced/unvoiced probability value for that phoneme; interpolating a f0 value for each phoneme using the f0 weight; normalizing the f0 values; and interpolating a spectral trajectory value for each phoneme using the spectrum weights. 8. The method of claim 7 wherein: the act of interpolating a duration value for each phoneme using the duration weights further comprises the acts of: multiplying the predicted duration values predicted using each source voice font by the corresponding duration weight to produce weighted duration values; and summing the weighted duration values from each source voice font for each phoneme to produce an interpolated duration value for that phoneme; the act of interpolating a voiced/unvoiced probability value for each phoneme using the spectrum weights further comprises the acts of: multiplying the predicted voiced/unvoiced probability values predicted using each source voice font by the corresponding spectrum weight to produce weighted voiced/unvoiced probability values; and summing the weighted voiced/unvoiced probability values from each source voice font for each phoneme to produce an interpolated voiced/unvoiced probability value for that phoneme; the act of interpolating a f0 value for each phoneme using the f0 weights further comprises the acts of: multiplying the predicted f0 values predicted using each source voice font by the corresponding f0 weight to produce weighted f0 values; and summing the corresponding weighted f0 values from each source voice font for each frame to produce an interpolated f0 value for that frame; and the act of interpolating spectral trajectory value for each phoneme using the spectrum weights further comprises the acts of: multiplying the predicted spectral trajectory values predicted using each source voice font by the corresponding spectrum weight to produce weighted spectral trajectory values; and summing the corresponding weighted spectral trajectory values from each source voice font for each frame to produce an interpolated spectral trajectory value for that frame. 9. The method of claim 1 further comprising the act of synthesizing the text as computer-generated speech using the interpolated characteristic values. 10. The method of claim 1 further comprising the act of storing a multi-voice font definition specifying the source voice fonts used to generate the multi-voice font and linking each source voice fonts with the characteristic weights assigned to that source voice font. 11. A system for generating a multi-voice font from a plurality of source voice fonts, the system comprising: a phoneme sequencer for parsing input text into a sequence of phonemes; a predictor operable to predict values of voice font characteristics for the phonemes for each source voice font of the plurality of source voice fonts using at least one characteristic model associated with each source voice font; a weight selector operable to assign a duration weight, a f0 weight, and a spectrum weight to each source voice font, the duration weight, the f0 weight, and the spectrum weight determining the relative contribution of the voice font characteristics predicted for the corresponding source voice font to the multi-voice font; an interpolator operable to merge the predicted voice font characteristics with the weights to produce the multi-voice font having voice font characteristics derived from the source voice fonts; and a voice encoder operable to render the input text as computer-generated speech using the multi-voice font, the computer-generated speech having the voice font characteristics derived from the source voice fonts. 12. The system of claim 11 wherein the predictor further comprises: a duration predictor operable to predict duration values for each phoneme using a duration prediction model provided by each source voice font; a f0 predictor operable to predict f0 values for each phoneme using a f0 prediction model provided by each source voice font; a spectral trajectory predictor operable to predict spectral trajectory values for each phoneme using a spectrum prediction model provided by each source voice font; and a voiced/unvoiced probability predictor operable to predict voiced/unvoiced probability values for each phoneme using a voiced/unvoiced decision model provided by each source voice font. 13. The system of claim 11 wherein the interpolator further comprises: a duration interpolator operable to merge the predicted duration values for each phoneme with the duration weight for the predicting source voice font to produce an interpolated duration value for each phoneme; a f0 interpolator operable to merge the predicted f0 values for each phoneme with the f0 weight for the predicting source voice font to produce an interpolated f0 value for each frame of the phoneme; a spectral trajectory interpolator operable to merge

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G10L13/08Primary
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
G06F3/04847
Interaction techniques to control parameter settings, e.g. interaction with sliders or dials · CPC title
G10L13/0335Primary
Pitch control · CPC title
G10L13/033
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
G06F3/0482
Interaction with lists of selectable items, e.g. menus · CPC title

Patent family

Related publications grouped by family.

View patent family 52596637

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9472182B2 cover?: Multi-voice font interpolation is provided. A multi-voice font interpolation engine allows the production of computer generated speech with a wide variety of speaker characteristics and/or prosody by interpolating speaker characteristics and prosody from existing fonts. Using prediction models from multiple voice fonts, the multi-voice font interpolation engine predicts values for the parameter…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 18 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).