Language modeling based on spoken and unspeakable corpuses

US9761220B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9761220-B2
Application numberUS-201514711447-A
CountryUS
Kind codeB2
Filing dateMay 13, 2015
Priority dateMay 13, 2015
Publication dateSep 12, 2017
Grant dateSep 12, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system for language modeling, the computer system comprising: a processor configured to execute computer-executable instructions; and memory storing computer-executable instructions configured to: collect training data from one or more information sources; generate a spoken corpus containing text of transcribed speech; generate a typed corpus containing typed text; derive feature vectors from the spoken corpus; analyze the typed corpus to determine feature vectors representing items of typed text; generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus; derive feature vectors from the unspeakable corpus; train a classifier based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus; generate, using the classifier, a new corpus based on new typed text from one or more of the information sources; train a language model using the new corpus; and perform speech recognition on audio data using the language model. 2. The computer system of claim 1 , wherein common features are expressed by the feature vectors derived from the spoken corpus, the feature vectors representing items of typed text, and the feature vectors derived from the unspeakable corpus. 3. The computer system of claim 1 , wherein the typed corpus contains typed text generated by users of a social networking service. 4. The computer system of claim 1 , wherein the feature vector derived from the spoken corpus presents features including item length and percentage of vowels. 5. The computer system of claim 1 , wherein the classifier is trained to predict whether an item of text is speakable enough to be used as training data for language modeling. 6. The computer system of claim 1 , wherein generating the new corpus comprises: collecting the new typed text from one or more of the information sources; determining feature vectors representing items of the new typed text; employing the classifier to predict whether each item of the new typed text is speakable based on a feature vector representing the item of the new typed text; and generating the new corpus with items of the new typed text that are predicted to be speakable. 7. The computer system of claim 1 , wherein the language model is trained using the spoken corpus. 8. The computer system of claim 1 , wherein the language model is a statistical language model for determining a conditional probability of an item given one or more previous items. 9. The computer system of claim 1 , wherein the feature vectors of the unspeakable corpus provide a negative example when training the classifier. 10. A computer-implemented method for language modeling performed by a computer system including one or more computing devices, the computer implemented method comprising: collecting training data from one or more information sources; generating a spoken corpus containing text of transcribed speech; generating a typed corpus containing typed text; deriving feature vectors from the spoken corpus; generating an unspeakable corpus by filtering the typed corpus to remove each item of typed text that is within a similarity threshold of one or more items in the spoken corpus; deriving feature vectors from the unspeakable corpus; training a classifier based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus; generating, using the classifier, a new corpus based on new typed text from one or more of the information sources; training a language model using the new corpus; performing speech recognition on audio data using the language model. 11. The computer-implemented method of claim 10 , wherein the unspeakable corpus is generated by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a threshold distance of a feature vector derived from the spoken corpus. 12. The computer-implemented method of claim 11 , wherein common features are expressed by the feature vectors derived from the spoken corpus, feature vectors representing items of typed text, and the feature vectors derived from the unspeakable corpus. 13. The computer-implemented method of claim 10 , wherein the typed corpus contains typed text generated by users of a social networking service. 14. The computer-implemented method of claim 10 , wherein generating the new corpus comprises: collecting the new typed text from one or more of the information sources; determining a feature vector representing an item of the new typed text; and employing the classifier to predict whether each item of the new typed text is speakable based on the feature vector representing the item of the new typed text. 15. The computer-implemented method of claim 10 , wherein the language model is a statistical language model for determining a conditional probability of an item given one or more previous items. 16. The computer-implemented method of claim 10 , further comprising training the language model based on the spoken corpus. 17. The computer-implemented method of claim 10 , wherein the feature vectors of the unspeakable corpus provide a negative example when training the classifier. 18. A computer-readable storage medium storing computer-executable instructions that, when executed by a computing device, cause the computing device to implement: a training data collection component configured to generate a spoken corpus containing text of transcribed speech and a typed corpus containing typed text; a filtering component configured to generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus; a classifier training component configured to train a classifier based on feature vectors derived from the spoken corpus and feature vectors derived from the unspeakable corpus; and a speech recognition component configured to recognize speech from audio data input using a language model trained based on one or more corpuses. 19. The computer-readable storage medium of claim 18 , further storing computer-executable instructions that, when executed by a computing device, cause the computing device to implement: a feature extraction component configured to determine feature vectors representing items of new typed text and employ the classifier to predict whether each item of new typed text is speakable based on a feature vector representing the item of new typed text. 20. The computer-readable storage medium of claim 19 , further storing computer executable instructions that, when executed by a computing device, cause the computing device to implement: a language model training component configured to train a language model based on a speakable corpus containing only items of new typed text that are predicted to be speakable.

Assignees

Inventors

Classifications

  • G10L15/19Primary

    Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules · CPC title

  • using natural language modelling · CPC title

  • using distance or distortion measures between unknown speech and reference templates · CPC title

  • G10L15/063Primary

    Training · CPC title

  • using lexical or orthographic knowledge sources · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9761220B2 cover?
A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an uns…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/19. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 12 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).