Speech recognition and text-to-speech learning system

US10089974B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10089974-B2
Application numberUS-201615087696-A
CountryUS
Kind codeB2
Filing dateMar 31, 2016
Priority dateMar 31, 2016
Publication dateOct 2, 2018
Grant dateOct 2, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.

First claim

Opening claim text (preview).

We claim: 1. A text-to-speech learning system, the system comprising: at least one processor; and at least one storage device, operatively connected to the at least one processor and storing: at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising: for each training pair: selecting a training pair from the at least one training corpus; generating a first pronunciation sequence from the speech input of the training pair; and generating a second pronunciation sequence from the text input of the training pair; determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein the pronunciation sequence conversion model is configured to synthesize speech by converting a pronunciation sequence generated in response to a received speech input to a target pronunciation sequence that more closely matches a pronunciation sequence extracted from the received speech input. 2. The text-to-speech learning system of claim 1 , wherein the method further comprises extracting an audio signal vector from the speech input of the training pair, and wherein the first pronunciation sequence is generated based on the extracted audio signal vector. 3. The text-to-speech learning system of claim 1 , wherein the pronunciation sequence conversion model comprises a recursive neural network. 4. The text-to-speech learning system of claim 1 , wherein determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence comprises aligning the first pronunciation sequence with the second pronunciation sequence. 5. The text-to-speech learning system of claim 1 , wherein the first pronunciation sequence comprises a sequence of pronunciation signals. 6. The text-to-speech learning system of claim 1 , wherein: the at least one training corpus comprises a text-to-speech training corpus comprising training pairs from a particular speaker and a speech-recognition training corpus comprising training pairs from different speakers; and the plurality of pronunciation sequence differences comprises at least one pronunciation sequence difference generated from a training pair selected from the text-to-speech training corpus and at least one pronunciation sequence difference generated from a training pair selected from the speech-recognition training corpus. 7. The text-to-speech learning system of claim 1 , wherein a pronunciation sequence generator model is configured to be used by a text-to-speech system to synthesize speech. 8. A speech recognition learning system, the system comprising: at least one processor; and at least one storage device, operatively connected to the at least one processor and storing: at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising: for each training pair, receiving a training pair from the at least one training corpus; extracting an audio signal vector from the speech input of the training pair; and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector; and adapting an acoustic model based on a plurality of converted audio signal vectors to generate an adapted acoustic model, wherein the adapted acoustic model is used to generate a pronunciation sequence during a speech recognition operation. 9. The speech recognition learning system of claim 8 , wherein the adapted acoustic model is configured to be used by a speech-recognition system to recognize speech from a user. 10. The speech recognition learning system of claim 8 , wherein the method further comprises generating an audio vector conversion model based on the plurality of training pairs. 11. The speech recognition learning system of claim 10 , wherein the method further comprises comparing an audio signal vector extracted from a speech input of a respective training pair of the plurality of training pairs to a second audio signal vector generated from the text input of the respective training pair. 12. The speech recognition learning system of claim 11 , wherein the method further comprises determining a difference between the extracted audio signal vector and the second audio signal vector. 13. The speech recognition learning system of claim 12 , wherein determining the difference between the extracted audio signal vector and the second audio signal vector comprises aligning the extracted audio signal vector with the second audio signal vector. 14. The speech recognition learning system of claim 11 , wherein the second audio signal vector is generated by extracting an audio signal vector from synthesized speech based on the text input of the respective training pair. 15. The speech recognition learning system of claim 10 , wherein the audio signal vector conversion model is configured to be used by a speech recognition system to recognize speech from a user. 16. The speech recognition learning system of claim 8 , wherein the adapted acoustic model is generated by adapting a plurality of extracted audio signal vectors from a plurality of speech inputs. 17. The speech recognition learning system of claim 16 , wherein: the at least one training corpus comprises a text-to-speech training corpus comprising training pairs from a particular speaker and a speech-recognition training corpus comprising training pairs from different speakers; and the plurality of speech inputs comprise at least one speech input from a training pair in the text-to-speech training corpus and at least one speech input from a training pair in the speech-recognition training corpus. 18. A method for generating a text-to-speech model and a speech-recognition model, the method comprising: generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein each of the pronunciation sequence differences is associated with a training pair from a plurality of training pairs stored in at least one training corpus and representing a varied vocabulary from one or more speakers, and each of the pronunciation sequence differences is generated by comparing a first pronunciation sequence generated from a speech input of a training pair associated with the pronunciation sequence difference to a second pronunciation sequence generated from a text input of the training pair associated with the pronunciation sequence difference; and adapting an acoustic model based on a plurality of converted audio signal vectors, wherein each of the converted audio signal vectors is associated with a speech input from a plurality of speech inputs and the each of the converted audio signal vectors is generated by extracting an audio signal vector from the speech input associa

Assignees

Inventors

Classifications

  • G10L13/10Primary

    Prosody rules derived from text; Stress or intonation · CPC title

  • Detection of language · CPC title

  • Training · CPC title

  • G10L13/08Primary

    Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • to the speaker · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10089974B2 cover?
An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).