Language modeling based on spoken and unspeakable corpuses

US10192545B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10192545-B2
Application numberUS-201715614283-A
CountryUS
Kind codeB2
Filing dateJun 5, 2017
Priority dateMay 13, 2015
Publication dateJan 29, 2019
Grant dateJan 29, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: at least one processor; and a memory storing instructions that when executed by the at least one processor perform a set of operations comprising: evaluating training data from a first information source; determining unspeakable portions of the training data; training a classifier based on the unspeakable portions of the training data, wherein the unspeakable portions of the training data are used as negative examples; generating, using the classifier, a corpus based on typed text from a second information source, wherein the classifier is used to filter unspeakable portions from the second information source; training a language model using the corpus; and performing speech recognition on audio data using the language model. 2. The system of claim 1 , wherein determining unspeakable portions of the training data comprises: identifying, based on one or more items of transcribed speech, speakable portions of the training data; and filtering the identified speakable portions from the training data to determine the unspeakable portions of the training data. 3. The system of claim 2 , wherein the unspeakable portions of the training data are used as negative examples when training the classifier, and wherein the classifier is trained based on the speakable portions of the training data and the unspeakable portions of the training data. 4. The system of claim 1 , wherein training the classifier based on the unspeakable portions of the training data comprises: generating one or more feature vectors for the unspeakable portions of the training data; and training the classifier based on the one or more feature vectors. 5. The system of claim 1 , wherein training the classifier comprises: training the classifier using a first subset of the unspeakable portions of the training data; and testing the classifier using a second subset of the unspeakable portions of the training data, wherein the second subset comprises different unspeakable portions than the first subset. 6. The system of claim 1 , wherein filtering unspeakable portions from the second information source comprises: classifying, using the classifier, one or more portions of the second information source as one of speakable and unspeakable based on a determination of how speakable each portion is; and filtering the portions that are classified as unspeakable from the second information source. 7. The system of claim 1 , wherein the second information source comprises typed text generated by users of a social networking service. 8. A computer-implemented method, the method comprising: evaluating training data from a first information source; determining speakable portions of the training data; training a classifier based on the training data, wherein the speakable portions of the training data are used as positive examples; filtering, using the classifier, a second information source to remove unspeakable portions from the second information source; generating a corpus based on the filtered second information source; training a language model using the corpus; and performing speech recognition on audio data using the language model. 9. The computer-implemented method of claim 8 , wherein determining the speakable portions of the training data comprises: generating one or more feature vectors for the training data; and determining the speakable portions based on a comparison of the one or more feature vectors of the training data to feature vectors for typed text, wherein the comparison is based on a similarity threshold. 10. The computer-implemented method of claim 8 , wherein determining speakable portions of the training data further comprises determining unspeakable portions of the training data, and wherein training the classifier comprises using the unspeakable portions of the training data as negative examples. 11. The computer-implemented method of claim 8 , wherein filtering the second information source to remove unspeakable portions from the second information source comprises: identifying, using the classifier, speakable portions of the second information source; and removing portions of the second information source that were not identified to be speakable portions. 12. The computer-implemented method of claim 8 , wherein training the classifier comprises: training the classifier using a first subset of the speakable portions of the training data; and testing the classifier using a second subset of the speakable portions of the training data, wherein the second subset comprises different speakable portions than the first subset. 13. The computer-implemented method of claim 8 , wherein the second information source comprises typed text generated by users of a social networking service. 14. A computer-implemented method, the method comprising: evaluating training data from a first information source; determining unspeakable portions of the training data; training a classifier based on the unspeakable portions of the training data; generating, using the classifier, a corpus based on typed text from a second information source, wherein the classifier is used to filter unspeakable portions from the second information source; training a language model using the corpus; and performing speech recognition on audio data using the language model. 15. The computer-implemented method of claim 14 , wherein determining unspeakable portions of the training data comprises: identifying, based on one or more items of transcribed speech, speakable portions of the training data; and filtering the identified speakable portions from the training data to determine the unspeakable portions of the training data. 16. The computer-implemented method of claim 15 , wherein the unspeakable portions of the training data are used as negative examples when training the classifier, and wherein the classifier is trained based on the speakable portions of the training data and the unspeakable portions of the training data. 17. The computer-implemented method of claim 14 , wherein training the classifier based on the unspeakable portions of the training data comprises: generating one or more feature vectors for the unspeakable portions of the training data; and training the classifier based on the one or more feature vectors. 18. The computer-implemented method of claim 14 , wherein training the classifier comprises: training the classifier using a first subset of the unspeakable portions of the training data; and testing the classifier using a second subset of the unspeakable portions of the training data, wherein the second subset comprises different unspeakable portions than the first subset. 19. The computer-implemented method of claim 14 , wherein filtering unspeakable portions from the second information source comprises: classifying, using the classifier, one or more portions of the second information source as one of speakable and unspeakable based on a determination of how speakable each portion is; and filtering the portions that are classified as unspeakable from the second information source. 20. The computer-implemented method of claim 14 , wherein the second information source comprises typed text generated by users of a social networking service.

Assignees

Inventors

Classifications

  • G10L15/063Primary

    Training · CPC title

  • G10L15/19Primary

    Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules · CPC title

  • using natural language modelling · CPC title

  • using lexical or orthographic knowledge sources · CPC title

  • using distance or distortion measures between unknown speech and reference templates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10192545B2 cover?
A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an uns…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 29 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).