System and method for building diverse language models

US9081760B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9081760-B2
Application numberUS-201113042890-A
CountryUS
Kind codeB2
Filing dateMar 8, 2011
Priority dateMar 8, 2011
Publication dateJul 14, 2015
Grant dateJul 14, 2015

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from previous crawling cycles by crawling documents whose vocabulary considered likely to fill gaps in the current language model. A language model from a previous cycle can be used to guide the creation of a language model in the following cycle. The novelty regions can include documents with high perplexity values over the current language model.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: identifying vocabulary gaps in a current language model; establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model; crawling, via a crawler operating on a computing device, the web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model based on the current language model and the new vocabulary words. 2. The method of claim 1 , further comprising recognizing received speech with the diverse language model. 3. The method of claim 1 , wherein the diverse language model is generated by modifying the current language model. 4. The method of claim 1 , wherein the web pages are identified using an information theoretic measure. 5. The method of claim 4 , wherein the web pages have high perplexity values over the current language model from a previous cycle. 6. The method of claim 1 , further comprising updating the visitation policy for the crawler once a specified number of pages is crawled. 7. The method of claim 6 , wherein updating the visitation policy is based on an expected perplexity value of the novelty regions. 8. The method of claim 7 , wherein the expected perplexity value of a web page is determined by evaluating links to the web page. 9. The method of claim 1 , further comprising merging a set of language models. 10. A system comprising: a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising: identifying vocabulary gaps in a current language model; establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model; crawling, via a crawler operating on a computing device, web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model based on the current language model and the new vocabulary words. 11. The system of claim 10 , wherein the web pages are identified using an information theoretic measure. 12. The system of claim 10 , wherein the web pages have high perplexity values over the current language model from a previous cycle. 13. The system of claim 10 , wherein the language model is further generated by updating the visitation policy for the crawler once a specified number of web pages is crawled. 14. The system of claim 13 , wherein updating the visitation policy is based on an expected perplexity value of the web pages. 15. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising: identifying vocabulary gaps in a current language model; establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model; crawling, via a crawler operating on a computing device, web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model based on the current language model and the new vocabulary words. 16. The computer-readable storage device of claim 15 , the computer-readable storage device having additional instructions stored which result in operations comprising recognizing received speech with the diverse language model. 17. The computer-readable storage device of claim 15 , the computer-readable storage device having additional instructions stored which result in operations comprising updating the visitation policy for the crawler once a crawling threshold is reached, wherein updating the visitation policy is based on an expected perplexity value of the web pages, and wherein the expected perplexity value of a web page is determined by evaluating links to the page. 18. The computer-readable storage device of claim 15 , the computer-readable storage device having additional instructions stored which result in operations comprising merging a set of language models. 19. The computer-readable storage device of claim 15 , wherein the new language model comprises a trigram model built using a language modeling toolkit.

Assignees

Inventors

Classifications

  • Parsing · CPC title

  • Dictionaries · CPC title

  • Recognition of textual entities · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • using lexical or orthographic knowledge sources · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9081760B2 cover?
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is co…
Who is the assignee on this patent?
Barbosa Luciano De Andrade, Bangalore Srinivas, At & T Ip I Lp
What technology area does this patent fall under?
Primary CPC classification G06F17/2715. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 14 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).