Language modeling based on spoken and unspeakable corpuses
US-2017270912-A1 · Sep 21, 2017 · US
US10192545B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10192545-B2 |
| Application number | US-201715614283-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 5, 2017 |
| Priority date | May 13, 2015 |
| Publication date | Jan 29, 2019 |
| Grant date | Jan 29, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.
Opening claim text (preview).
What is claimed is: 1. A system comprising: at least one processor; and a memory storing instructions that when executed by the at least one processor perform a set of operations comprising: evaluating training data from a first information source; determining unspeakable portions of the training data; training a classifier based on the unspeakable portions of the training data, wherein the unspeakable portions of the training data are used as negative examples; generating, using the classifier, a corpus based on typed text from a second information source, wherein the classifier is used to filter unspeakable portions from the second information source; training a language model using the corpus; and performing speech recognition on audio data using the language model. 2. The system of claim 1 , wherein determining unspeakable portions of the training data comprises: identifying, based on one or more items of transcribed speech, speakable portions of the training data; and filtering the identified speakable portions from the training data to determine the unspeakable portions of the training data. 3. The system of claim 2 , wherein the unspeakable portions of the training data are used as negative examples when training the classifier, and wherein the classifier is trained based on the speakable portions of the training data and the unspeakable portions of the training data. 4. The system of claim 1 , wherein training the classifier based on the unspeakable portions of the training data comprises: generating one or more feature vectors for the unspeakable portions of the training data; and training the classifier based on the one or more feature vectors. 5. The system of claim 1 , wherein training the classifier comprises: training the classifier using a first subset of the unspeakable portions of the training data; and testing the classifier using a second subset of the unspeakable portions of the training data, wherein the second subset comprises different unspeakable portions than the first subset. 6. The system of claim 1 , wherein filtering unspeakable portions from the second information source comprises: classifying, using the classifier, one or more portions of the second information source as one of speakable and unspeakable based on a determination of how speakable each portion is; and filtering the portions that are classified as unspeakable from the second information source. 7. The system of claim 1 , wherein the second information source comprises typed text generated by users of a social networking service. 8. A computer-implemented method, the method comprising: evaluating training data from a first information source; determining speakable portions of the training data; training a classifier based on the training data, wherein the speakable portions of the training data are used as positive examples; filtering, using the classifier, a second information source to remove unspeakable portions from the second information source; generating a corpus based on the filtered second information source; training a language model using the corpus; and performing speech recognition on audio data using the language model. 9. The computer-implemented method of claim 8 , wherein determining the speakable portions of the training data comprises: generating one or more feature vectors for the training data; and determining the speakable portions based on a comparison of the one or more feature vectors of the training data to feature vectors for typed text, wherein the comparison is based on a similarity threshold. 10. The computer-implemented method of claim 8 , wherein determining speakable portions of the training data further comprises determining unspeakable portions of the training data, and wherein training the classifier comprises using the unspeakable portions of the training data as negative examples. 11. The computer-implemented method of claim 8 , wherein filtering the second information source to remove unspeakable portions from the second information source comprises: identifying, using the classifier, speakable portions of the second information source; and removing portions of the second information source that were not identified to be speakable portions. 12. The computer-implemented method of claim 8 , wherein training the classifier comprises: training the classifier using a first subset of the speakable portions of the training data; and testing the classifier using a second subset of the speakable portions of the training data, wherein the second subset comprises different speakable portions than the first subset. 13. The computer-implemented method of claim 8 , wherein the second information source comprises typed text generated by users of a social networking service. 14. A computer-implemented method, the method comprising: evaluating training data from a first information source; determining unspeakable portions of the training data; training a classifier based on the unspeakable portions of the training data; generating, using the classifier, a corpus based on typed text from a second information source, wherein the classifier is used to filter unspeakable portions from the second information source; training a language model using the corpus; and performing speech recognition on audio data using the language model. 15. The computer-implemented method of claim 14 , wherein determining unspeakable portions of the training data comprises: identifying, based on one or more items of transcribed speech, speakable portions of the training data; and filtering the identified speakable portions from the training data to determine the unspeakable portions of the training data. 16. The computer-implemented method of claim 15 , wherein the unspeakable portions of the training data are used as negative examples when training the classifier, and wherein the classifier is trained based on the speakable portions of the training data and the unspeakable portions of the training data. 17. The computer-implemented method of claim 14 , wherein training the classifier based on the unspeakable portions of the training data comprises: generating one or more feature vectors for the unspeakable portions of the training data; and training the classifier based on the one or more feature vectors. 18. The computer-implemented method of claim 14 , wherein training the classifier comprises: training the classifier using a first subset of the unspeakable portions of the training data; and testing the classifier using a second subset of the unspeakable portions of the training data, wherein the second subset comprises different unspeakable portions than the first subset. 19. The computer-implemented method of claim 14 , wherein filtering unspeakable portions from the second information source comprises: classifying, using the classifier, one or more portions of the second information source as one of speakable and unspeakable based on a determination of how speakable each portion is; and filtering the portions that are classified as unspeakable from the second information source. 20. The computer-implemented method of claim 14 , wherein the second information source comprises typed text generated by users of a social networking service.
Training · CPC title
Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules · CPC title
using natural language modelling · CPC title
using lexical or orthographic knowledge sources · CPC title
using distance or distortion measures between unknown speech and reference templates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.