Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L13/10. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speech recognition and text-to-speech learning system

US10089974B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10089974-B2
Application number	US-201615087696-A
Country	US
Kind code	B2
Filing date	Mar 31, 2016
Priority date	Mar 31, 2016
Publication date	Oct 2, 2018
Grant date	Oct 2, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.

First claim

Opening claim text (preview).

We claim: 1. A text-to-speech learning system, the system comprising: at least one processor; and at least one storage device, operatively connected to the at least one processor and storing: at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising: for each training pair: selecting a training pair from the at least one training corpus; generating a first pronunciation sequence from the speech input of the training pair; and generating a second pronunciation sequence from the text input of the training pair; determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein the pronunciation sequence conversion model is configured to synthesize speech by converting a pronunciation sequence generated in response to a received speech input to a target pronunciation sequence that more closely matches a pronunciation sequence extracted from the received speech input. 2. The text-to-speech learning system of claim 1 , wherein the method further comprises extracting an audio signal vector from the speech input of the training pair, and wherein the first pronunciation sequence is generated based on the extracted audio signal vector. 3. The text-to-speech learning system of claim 1 , wherein the pronunciation sequence conversion model comprises a recursive neural network. 4. The text-to-speech learning system of claim 1 , wherein determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence comprises aligning the first pronunciation sequence with the second pronunciation sequence. 5. The text-to-speech learning system of claim 1 , wherein the first pronunciation sequence comprises a sequence of pronunciation signals. 6. The text-to-speech learning system of claim 1 , wherein: the at least one training corpus comprises a text-to-speech training corpus comprising training pairs from a particular speaker and a speech-recognition training corpus comprising training pairs from different speakers; and the plurality of pronunciation sequence differences comprises at least one pronunciation sequence difference generated from a training pair selected from the text-to-speech training corpus and at least one pronunciation sequence difference generated from a training pair selected from the speech-recognition training corpus. 7. The text-to-speech learning system of claim 1 , wherein a pronunciation sequence generator model is configured to be used by a text-to-speech system to synthesize speech. 8. A speech recognition learning system, the system comprising: at least one processor; and at least one storage device, operatively connected to the at least one processor and storing: at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising: for each training pair, receiving a training pair from the at least one training corpus; extracting an audio signal vector from the speech input of the training pair; and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector; and adapting an acoustic model based on a plurality of converted audio signal vectors to generate an adapted acoustic model, wherein the adapted acoustic model is used to generate a pronunciation sequence during a speech recognition operation. 9. The speech recognition learning system of claim 8 , wherein the adapted acoustic model is configured to be used by a speech-recognition system to recognize speech from a user. 10. The speech recognition learning system of claim 8 , wherein the method further comprises generating an audio vector conversion model based on the plurality of training pairs. 11. The speech recognition learning system of claim 10 , wherein the method further comprises comparing an audio signal vector extracted from a speech input of a respective training pair of the plurality of training pairs to a second audio signal vector generated from the text input of the respective training pair. 12. The speech recognition learning system of claim 11 , wherein the method further comprises determining a difference between the extracted audio signal vector and the second audio signal vector. 13. The speech recognition learning system of claim 12 , wherein determining the difference between the extracted audio signal vector and the second audio signal vector comprises aligning the extracted audio signal vector with the second audio signal vector. 14. The speech recognition learning system of claim 11 , wherein the second audio signal vector is generated by extracting an audio signal vector from synthesized speech based on the text input of the respective training pair. 15. The speech recognition learning system of claim 10 , wherein the audio signal vector conversion model is configured to be used by a speech recognition system to recognize speech from a user. 16. The speech recognition learning system of claim 8 , wherein the adapted acoustic model is generated by adapting a plurality of extracted audio signal vectors from a plurality of speech inputs. 17. The speech recognition learning system of claim 16 , wherein: the at least one training corpus comprises a text-to-speech training corpus comprising training pairs from a particular speaker and a speech-recognition training corpus comprising training pairs from different speakers; and the plurality of speech inputs comprise at least one speech input from a training pair in the text-to-speech training corpus and at least one speech input from a training pair in the speech-recognition training corpus. 18. A method for generating a text-to-speech model and a speech-recognition model, the method comprising: generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein each of the pronunciation sequence differences is associated with a training pair from a plurality of training pairs stored in at least one training corpus and representing a varied vocabulary from one or more speakers, and each of the pronunciation sequence differences is generated by comparing a first pronunciation sequence generated from a speech input of a training pair associated with the pronunciation sequence difference to a second pronunciation sequence generated from a text input of the training pair associated with the pronunciation sequence difference; and adapting an acoustic model based on a plurality of converted audio signal vectors, wherein each of the converted audio signal vectors is associated with a speech input from a plurality of speech inputs and the each of the converted audio signal vectors is generated by extracting an audio signal vector from the speech input associa

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G10L13/10Primary
Prosody rules derived from text; Stress or intonation · CPC title
G10L13/086
Detection of language · CPC title
G10L15/063
Training · CPC title
G10L13/08Primary
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
G10L15/07
to the speaker · CPC title

Patent family

Related publications grouped by family.

View patent family 58503724

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10089974B2 cover?: An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first …
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Speech Recognition and Text-to-Speech Learning System

Vpa with integrated object recognition and facial expression recognition

Online incremental adaptation of deep neural networks using auxiliary Gaussian mixture models in speech recognition

Dynamic speech system tuning

Speech recognizer with multi-directional decoding

Frequently asked questions