Speech recognition assisted evaluation on text-to-speech pronunciation issue detection

US9293129B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9293129-B2
Application numberUS-201313785573-A
CountryUS
Kind codeB2
Filing dateMar 5, 2013
Priority dateMar 5, 2013
Publication dateMar 22, 2016
Grant dateMar 22, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Pronunciation issues for synthesized speech are automatically detected using human recordings as a reference within a Speech Recognition Assisted Evaluation (SRAE) framework including a Text-To-Speech flow and a Speech Recognition (SR) flow. A pronunciation issue detector evaluates results obtained at multiple levels of the TTS flow and the SR flow (e.g. phone, word, and signal level) by using the corresponding human recordings as the reference for the synthesized speech, and outputs possible pronunciation issues. A signal level may be used to determine similarities/differences between the recordings and the TTS output. A model level checker may provide results to the pronunciation issue detector to check the similarities of the TTS and the SR phone set including mapping relations. Results from a comparison of the SR output and the recordings may also be evaluation by the pronunciation issue detector. The pronunciation issue detector outputs a list that lists potential pronunciation issue candidates.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for determining pronunciation issues, comprising: receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text; receiving synthesized speech generated by the TTS component using the text as input to the TTS component; evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording, wherein the evaluation at the text level comprises performing a similarity measurement of a phone sequence of a sentence in the text and a corresponding phone sequence of a sentence in the recording; evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; and generating a list that includes a ranking of pronunciation issue candidates based on the evaluations. 2. The method of claim 1 , further comprising evaluating results from a signal level evaluation of phone sequences of the text using a phone sequence determined from the TTS component and an SR phone sequence of the recording. 3. The method of claim 1 , wherein the evaluation at the text level further comprises performing evaluations for a word sequence and a phone sequence of each sentence within the text. 4. The method of claim 1 , further comprising performing a model level check for an acoustic model that determines a similarity of a TTS phone set and an SR phone set including determining a mapping relation between the TTS acoustic model and the SR acoustic model. 5. The method of claim 1 , wherein the evaluation performed at the text level comprises determining a similarity using an equation as defined by: s = 1 - C Sub + C Ins C Corr + C Sub + C Del where s is a similarity score; C Corr , C Sub , C Ins and C Del denote counts of correct components, substitution errors, insertion errors, and deletion errors in a sentence. 6. The method of claim 1 , wherein generating the list that includes the ranking of pronunciation issue candidates comprises filtering out mismatched words for judgment labels based on at least one of the evaluations using the synthesized speech and the recording. 7. The method of claim 1 , wherein the results received by the evaluation performed at the text level and the results obtained from the SR component are received by a pronunciation issue detector that is configured to perform the evaluations and to generate the list. 8. A tangible computer-readable storage device storing computer-executable instructions for determining pronunciation issues, comprising: receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text; receiving synthesized speech generated by the TTS component using the text as input to the TTS component; evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording; evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; evaluating results from a signal level evaluation of the text and the recording; and generating a list that includes a ranking of pronunciation issue candidates based on the evaluations. 9. The tangible computer-readable storage device of claim 8 , wherein the signal level evaluation of the text comprises evaluating a similarity of the recording of phone sequences of the text using a phone sequence determined from the TTS component and an SR phone sequence of the recording. 10. The tangible computer-readable storage device of claim 8 , wherein the evaluation at the text level comprises performing a similarity measurement of a phone sequence of each sentence in the text and a corresponding phone sequence of each sentence in the recording. 11. The tangible computer-readable storage device of claim 8 , further comprising performing a model level check for an acoustic model that determines a similarity of a TTS phone set and an SR phone set including determining a mapping relation between the TTS acoustic model and the SR acoustic model. 12. The tangible computer-readable storage device of claim 8 , wherein the evaluation performed at the text level comprises determining a similarity using an equation as defined by: s = 1 - C Sub + C Ins C Corr + C Sub + C Del where s is a similarity score; C Corr , C Sub , C Ins and C Del denote counts of correct components, substitution errors, insertion errors, and deletion errors in a sentence. 13. The tangible computer-readable storage device of claim 8 , wherein generating the list that includes the ranking of pronunciation issue candidates comprises filtering out mismatched words for judgment labels based on at least one of the evaluations using the synthesized speech and the recording. 14. A system for determining pronunciation issues, comprising: a processor and memory; an operating environment executing using the processor; text comprising sentences and a recording that corresponds to the text; a Text-To-Speech (TTS) component configured to generate synthesized speech using the text; a Speech Recognition (SR) component configured to recognize speech; and a pronunciation issue detector that is configured to perform actions comprising: receiving the synthesized speech generated by the TTS component; evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording; evaluating results obtained from the SR component related to different inputs to the SR component comprising the synthesized speech and the recording; evaluating results from a signal level evaluation of the text and the recording; and generating a list that includes a ranking of pronunciation issue candidates based on the evaluations.

Assignees

Inventors

Classifications

  • G10L13/08Primary

    Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • G10L13/086Primary

    Detection of language · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9293129B2 cover?
Pronunciation issues for synthesized speech are automatically detected using human recordings as a reference within a Speech Recognition Assisted Evaluation (SRAE) framework including a Text-To-Speech flow and a Speech Recognition (SR) flow. A pronunciation issue detector evaluates results obtained at multiple levels of the TTS flow and the SR flow (e.g. phone, word, and signal level) by using …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 22 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).