Discriminative data selection for language modeling

US2016336006A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016336006-A1
Application numberUS-201514711447-A
CountryUS
Kind codeA1
Filing dateMay 13, 2015
Priority dateMay 13, 2015
Publication dateNov 17, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer system for language modeling, the computer system comprising: a processor configured to execute computer-executable instructions; and memory storing computer-executable instructions configured to: collect training data from one or more information sources; generate a spoken corpus containing text of transcribed speech; generate a typed corpus containing typed text; derive feature vectors from the spoken corpus; analyze the typed corpus to determine feature vectors representing items of typed text; generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus; derive feature vectors from the unspeakable corpus; and train a classifier based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus. 2 . The computer system of claim 1 , wherein common features are expressed by the feature vectors derived from the spoken corpus, the feature vectors representing items of typed text, and the feature vectors derived from the unspeakable corpus. 3 . The computer system of claim 1 , wherein the typed corpus contains typed text generated by users of a social networking service. 4 . The computer system of claim 1 , wherein the feature vector derived from the spoken corpus presents features including item length and percentage of vowels. 5 . The computer system of claim 1 , wherein the classifier is trained to predict whether an item of text is speakable enough to be used as training data for language modeling. 6 . The computer system of claim 1 , wherein the memory further stores computer-executable instructions configured to: collect new typed text from one or more of the information sources; determine feature vectors representing items of new typed text; employ the classifier to predict whether each item of new typed text is speakable based on a feature vector representing the item of new typed text; generate a speakable corpus containing only items of new typed text that are predicted to be speakable; and train a language model based on the speakable corpus. 7 . The computer system of claim 6 , wherein the memory further stores computer-executable instructions configured to: train the language model based on the spoken corpus. 8 . The computer system of claim 6 , wherein the language model is a statistical language model for determining a conditional probability of an item given one or more previous items. 9 . The computer system of claim 6 , wherein the memory further stores computer-executable instructions configured to: perform speech recognition based on the language model. 10 . A computer-implemented method for language modeling performed by a computer system including one or more computing devices, the computer-implemented method comprising: collecting training data from one or more information sources; generating a spoken corpus containing text of transcribed speech; generating a typed corpus containing typed text; deriving feature vectors from the spoken corpus; generating an unspeakable corpus by filtering the typed corpus to remove each item of typed text that is within a similarity threshold of one or more items in the spoken corpus; deriving feature vectors from the unspeakable corpus; and training a classifier based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus. 11 . The computer-implemented method of claim 10 , wherein the unspeakable corpus is generated by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a threshold distance of a feature vector derived from the spoken corpus. 12 . The computer-implemented method of claim 11 , wherein common features are expressed by the feature vectors derived from the spoken corpus, feature vectors representing items of typed text, and the feature vectors derived from the unspeakable corpus. 13 . The computer-implemented method of claim 10 , wherein the typed corpus contains typed text generated by users of a social networking service. 14 . The computer-implemented method of claim 10 , further comprising: collecting new typed text from one or more of the information sources; determining a feature vector representing an item of new typed text; and employing the classifier to predict whether the item of new typed text is speakable based on the feature vector representing the item of new typed text. 15 . The computer-implemented method of claim 14 , further comprising: generating a speakable corpus containing only items of new typed text that are predicted to be speakable; and training a language model based on the speakable corpus. 16 . The computer-implemented method of claim 15 , further comprising training the language model based on the spoken corpus. 17 . The computer-implemented method of claim 15 , further comprising performing speech recognition based on the language model. 18 . A computer-readable storage medium storing computer-executable instructions that, when executed by a computing device, cause the computing device to implement: a training data collection component configured to generate a spoken corpus containing text of transcribed speech and a typed corpus containing typed text; a filtering component configured to generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus; and a classifier training component configured to train a classifier based on feature vectors derived from the spoken corpus and feature vectors derived from the unspeakable corpus. 19 . The computer-readable storage medium of claim 18 , further storing computer-executable instructions that, when executed by a computing device, cause the computing device to implement: a feature extraction component configured to determine feature vectors representing items of new typed text and employ the classifier to predict whether each item of new typed text is speakable based on a feature vector representing the item of new typed text. 20 . The computer-readable storage medium of claim 19 , further storing computer-executable instructions that, when executed by a computing device, cause the computing device to implement: a language model training component configured to train a language model based on a speakable corpus containing only items of new typed text that are predicted to be speakable.

Assignees

Inventors

Classifications

  • G10L15/19Primary

    Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules · CPC title

  • using distance or distortion measures between unknown speech and reference templates · CPC title

  • using natural language modelling · CPC title

  • updating or merging of old and new templates; Mean values; Weighting · CPC title

  • using lexical or orthographic knowledge sources · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016336006A1 cover?
A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an uns…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/19. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Nov 17 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).