Text to speech processing system and method, and an acoustic model training system and method

US10140972B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10140972-B2
Application numberUS-201414466340-A
CountryUS
Kind codeB2
Filing dateAug 22, 2014
Priority dateAug 23, 2013
Publication dateNov 27, 2018
Grant dateNov 27, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of training an acoustic model for a text-to-speech system, the method comprising: receiving speech data; said speech data comprising data corresponding to different values of a first speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.

First claim

Opening claim text (preview).

The invention claimed is: 1. A text-to-speech method configured to output speech having a target value of a speech factor, said method comprising: inputting audio data with said target value of a speech factor; adapting an acoustic model to said target value of a speech factor; inputting text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and outputting said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a set of speech factor parameters that enable the acoustic model to accommodate speech for the different values of a speech factor, and wherein said set of speech factor parameters are unlabeled, such that for a given one or more parameters, the value of said speech factor to which they relate is unknown, and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor, wherein said text to speech method includes training said acoustic model using a method comprising: receiving speech data, said speech data comprising data corresponding to different values of the speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown; clustering said speech data according to the value of said speech factor into a first set of clusters; and estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor, wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion. 2. The text to speech method of claim 1 , wherein said speech factor is expression and the acoustic model further comprises a set of expression parameters relating to speaker and a set of clusters relating to speaker; and wherein said set of expression parameters and said set of speaker parameters and said set of expression clusters and said set of speaker clusters do not overlap, and wherein the method is configured to transplant an expression from a first speaker to a second speaker, by employing expression parameters obtained from the speech of a first speaker with that of a second speaker. 3. A text to speech method, the method comprising: receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a set of speaker parameters and a set of speaker clusters relating to speaker voice and a set of expression parameters and a set of expression clusters relating to expression, and wherein the sets of speaker and expression parameters and the sets of speaker and expression clusters do not overlap; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said parameters relating to expression by: extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising: receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering. 4. The method of claim 3 wherein said second space is the acoustic space of a first speaker and the method is configured to transplant the expressive synthesis feature vector to the acoustic space of a second speaker. 5. A text to speech method, the method comprising: receiving input text; dividing said inputted text into a sequence of acoustic units; converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model; and outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said second set of parameters by: extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space, wherein said text to speech method includes training said acoustic model using a method comprising: receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech, said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively, wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering, and wherein said first and second set of parameters and said first and second set of clusters do not overlap. 6. A system configured to output speech having a target value of a speech factor, said system comprising: an audio input for receiving audio data with said target value of a speech factor; a text input for receiving text; and a processor configured to adapt an acoustic model to said target value of a speech factor; divide said inputted text into a sequence of acoustic units; convert said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and output said sequence of speech vectors as audio with said target value of a speech factor, wherein said acoustic model comprises a first set of parameters that enable to acoustic model to accommodate speech for the different values of the speech factor; and wherein said first set of parameters are unlabeled, such that for a given one or more parameters, the value of said first speech factor is unknown, and wherein adapting the acoustic mod

Assignees

Inventors

Classifications

  • G10L13/08Primary

    Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice (G10L15/14 takes precedence) · CPC title

  • Methods for producing synthetic speech; Speech synthesisers · CPC title

  • G10L13/10Primary

    Prosody rules derived from text; Stress or intonation · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10140972B2 cover?
A method of training an acoustic model for a text-to-speech system, the method comprising: receiving speech data; said speech data comprising data corresponding to different values of a first speech factor, and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown; clustering said speech data according…
Who is the assignee on this patent?
Toshiba Kk
What technology area does this patent fall under?
Primary CPC classification G10L13/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 27 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).