System and method for automatic prediction of speech suitability for statistical modeling

US9484045B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9484045-B2
Application numberUS-201213606618-A
CountryUS
Kind codeB2
Filing dateSep 7, 2012
Priority dateSep 7, 2012
Publication dateNov 1, 2016
Grant dateNov 1, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses prediction capability to provide an automatic acoustic driven template versus model decision maker with an output quality that is high, stable and depends gradually on the system footprint. In speaker selection for a statistical Text-to-Speech synthesis (TTS) system build, as another example context, an embodiment according to the invention enables a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation, based on a small amount of recorded speech material.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer server system for automatically determining suitability of at least a portion of a speech signal, comprising voice data, for statistical modeling, the system comprising: a memory storing computer code instructions thereon; and a processor, the memory, with the computer code instructions, and the processor being configured to cause the computer server system to implement: a modelability estimator configured to: determine a statistical modelability score of the at least a portion of the speech signal comprising voice data, the statistical modelability score indicating favorability of the at least a portion of the speech signal for statistical modeling in terms of human perception and based at least in part on determining a temporal stationarity of the at least a portion of the speech signal comprising voice data; and forward the statistical modelability score to a speech synthesis system executed by the processor, wherein the speech synthesis system is configured to utilize the modelability score in converting text to speech; and a decision maker configured to determine a preferred speaker selection for use by the speech synthesis system in building a statistical text-to-speech system based on the statistical modelability score determined for speech provided by each of a plurality of speakers. 2. The computer server system according to claim 1 , wherein the modelability estimator is further configured to determine the temporal stationarity based on variability of an instantaneous spectrum of the at least a portion of the speech signal. 3. The computer server system according to claim 2 , wherein the modelability estimator is still further configured to determine the variability of the instantaneous spectrum based on (i) a first moment of an instantaneous spectrum component distribution and (ii) a second moment of the instantaneous spectrum component distribution. 4. The computer server system according to claim 1 , wherein the decision maker is further configured to: determine a segment representation type to be used by the speech synthesis system in a multi-form segment speech synthesis based on the statistical modelability score. 5. The computer server system according to claim 4 , wherein the modelability estimator is further configured to determine the statistical modelability score for at least one segment comprising at least a portion of an output speech signal being synthesized, and wherein the decision maker is further configured to determine the segment representation type, for the at least one segment, based on at least the statistical modelability score for the at least one segment. 6. The computer server system according to claim 4 , wherein the modelability estimator is further configured to determine for at least one segment comprising at least a portion of an output speech signal being synthesized, the statistical modelability score for a segment cluster that includes the at least one segment, and wherein the decision maker is further configured to determine the segment representation type, for the at least one segment, based on at least the statistical modelability score of the segment cluster that includes the at least one segment. 7. The computer server system according to claim 4 , further comprising a templates pruner configured to remove from a voice dataset at least one segment relative to its statistical modelability score. 8. The computer server system according to claim 4 , wherein the statistical modelability score is further based at least in part on a loudness score. 9. A computerized method of automatically determining, by a server, suitability of at least a portion of a speech signal, comprising voice data, for statistical modeling, the computerized method comprising: determining a statistical modelability score of the at least a portion of the speech signal comprising voice data, the statistical modelability score indicating favorability of the at least a portion of the speech signal for statistical modeling in terms of human perception and based at least in part on a temporal stationarity of the at least a portion of the speech signal comprising voice data; forwarding the statistical modelability score to a speech synthesis system implemented by the server, wherein the speech synthesis system is configured to utilize the modelability score in converting text to speech; and determining a preferred speaker selection for use by the speech synthesis system in building a statistical text-to-speech system based on the statistical modelability score determined for speech provided by each of a plurality of speakers. 10. The computerized method according to claim 9 , wherein the temporal stationarity is determined based on variability of an instantaneous spectrum of the at least a portion of the speech signal. 11. The computerized method according to claim 10 , wherein the variability of the instantaneous spectrum is determined based on (i) a first moment of an instantaneous spectrum component distribution and (ii) a second moment of the instantaneous spectrum component distribution. 12. The computerized method according to claim 9 , wherein the method comprises determining a segment representation type to be used by the speech synthesis system in a multi-form segment speech synthesis system based on the statistical modelability score. 13. The computerized method according to claim 12 , further comprising: determining the statistical modelability score for at least one segment comprising at least a portion of an output speech signal being synthesized; and determining the segment representation type, for the at least one segment, based on at least the statistical modelability score for the at least one segment. 14. The computerized method according to claim 12 , further comprising: determining, for at least one segment comprising at least a portion of an output speech signal being synthesized, the statistical modelability score for a segment cluster that includes the at least one segment; and determining the segment representation type, for the at least one segment based on at least the statistical modelability score of the segment cluster that includes the at least one segment. 15. The computerized method according to claim 14 , further comprising removing from a voice dataset at least one segment relative to its statistical modelability score. 16. The computerized method according to claim 12 , further comprising determining the statistical modelability score based at least in part on a loudness score. 17. A non-transitory computer-readable storage medium having computer-readable code stored thereon, which, when executed by a computer processor, causes the computer processor to automatically determine suitability of at least a portion of a speech signal, comprising voice data, for statistical modeling, by causing the processor to: determine a statistical modelability score of the at least a portion of the speech signal comprising voice data, the statistical modelability score indicating favorability of the at least a portion of the speech signal for statistical modeling in terms of human perception and the statistical modelability score being based at least in part on a temporal stationarity of the at least a portion of the speech signal comprising voice data; forward the statistical modelability score to a speech synthesis system executed by the processor, wherein the speech synthesis system is configured to utilize the modelability score in converting text to speech; and determine a preferred speaker selection

Assignees

Inventors

Classifications

  • Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title

  • G10L25/48Primary

    specially adapted for particular use · CPC title

  • the extracted parameters being spectral information of each sub-band · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9484045B2 cover?
An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses prediction capability to provide an automatic acoustic driven template versus model d…
Who is the assignee on this patent?
Sorin Alexander, Shechtman Slava, Pollet Vincent, and 1 more
What technology area does this patent fall under?
Primary CPC classification G10L25/48. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 01 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).