Language model optimization for in-domain application

US9972311B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9972311-B2
Application numberUS-201414271962-A
CountryUS
Kind codeB2
Filing dateMay 7, 2014
Priority dateMay 7, 2014
Publication dateMay 15, 2018
Grant dateMay 15, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are provided for optimizing language models for in-domain applications through an iterative, joint-modeling approach that expresses training material as alternative representations of higher-level tokens, such as named entities and carrier phrases. From a first language model, an in-domain training corpus may be represented as a set of alternative parses of tokens. Statistical information determined from these parsed representations may be used to produce a second (or updated) language model, which is further optimized for the domain. The second language model may be used to determine another alternative parsed representation of the corpus for a next iteration, and the statistical information determined from this representation may be used to produce a third (or further updated) language model. Through each iteration, a language model may be determined that is further optimized for the domain.

First claim

Opening claim text (preview).

What is claimed is: 1. One or more computer storage media having computer-executable instructions embodied thereon, that, when executed by a computing system having a processor and memory, cause the computing system to perform a method for generating a language model optimized for an in-domain application through a joint modeling approach, comprising: (a) receiving training material for training the language model, the training material including a corpus of one or more words and a predefined set of entity definitions including a weighted list of entities, wherein the predefined set of entity definitions is determined independently of the corpus of one or more words; (b) identifying entities that occur within the corpus of one or more words, based on the weighted list of entities; (c) determining the language model based on the corpus of one or more words and the weighted list of entities; (d) utilizing the language model to determine a set of at least two alternative parses of the corpus, each alternative parse comprising a sequence of one or more words, phrases, or entities, each sequence including the complete corpus; (e) determining updated weights of entities in the set of entity definitions for each of the alternative parses determined in step (d); (f) based on the updated weights of entities for each of the alternative parses determined in step (e), updating the language model; (g) determining whether the language model is satisfactory; (h) based on the determination of whether the language model is satisfactory: (i) if the language model is determined to be satisfactory, utilizing the language model for the in-domain application; and (ii) if the language model is determined not to be satisfactory, repeating steps (d) through (h), wherein the updated language model is utilized to determine the set of at least two alternative parses of the corpus. 2. The one or more computer storage media of claim 1 , wherein the language model is determined as satisfactory where it has achieved convergence, and wherein the in-domain application comprises a computer-performed machine translation, contextual understanding, or automatic speech recognition. 3. The one or more computer storage media of claim 1 , wherein the training material is determined from one or more user query logs, SMS messages, web documents, electronic libraries, books, user input libraries, or created samples. 4. The one or more computer storage media of claim 1 , wherein one or more thresholds is applied to control a size of the updated language model. 5. The one or more computer storage media of claim 1 , wherein determining statistical data based on the set of alternative parses includes determining a maximum-likelihood solution associated with the alternative parses. 6. A language-model training system for generating a language model optimized for a domain using a joint modeling approach comprising: one or more processors; a parsing component configured for determining a set of alternative parses comprising sequences of tokens representing a corpus of text; one or more computer storage media having computer-executable instructions stored thereon which, when executed by the processor, implement a method comprising: accessing a training corpus, the training corpus comprising a plurality of words; applying a first language model to determine, using the parsing component, a first set of alternative parses of the training corpus, each parse including a sequence of one or more tokens, each token comprising one or more words, phrases, or entities, at least one parse comprising a phrase token, wherein the phrase token comprises a sequence of words; determining a first set of statistical data associated with each alternative parse in the first set of alternative parses, wherein the first set of statistical data includes weights associated with the one or more words, phrases, or entities, and includes a weight associated with the phrase token; and generating a second language model based on the first set of statistical data associated with each alternative parse including the weight associated with the phrase token, wherein the sequence of words in the phrase token is encoded as an independent pseudo-word. 7. The system of claim 6 , wherein generating a second language model includes determining a maximum-likelihood solution associated with the alternative parses. 8. The system of claim 6 , wherein the first language model is determined from an initial set of statistical data associated with the training corpus. 9. The system of claim 6 , wherein the first set of statistical data is determined based on frequency, occurrence, proximity or word distance between words or elements in the corpus or token. 10. The system of claim 6 , wherein the first set of statistical data comprises probabilities associated with entity definitions. 11. The system of claim 6 wherein the method further comprises: applying the second language model to determine a second set of alternative parses of the training corpus; determining a second set of statistical data associated with the second set of alternative statistical parses; and generating a third language model based on the second set of statistical data. 12. The system of claim 6 , wherein a token comprises a plurality of words, phrases, or entities, or a combination from two or more words, phrases, or entities. 13. The system of claim 6 further comprising accessing a set of entity definitions comprising a weighted list of one or more entities, and wherein determining a first set of statistical data associated with the first set of alternative parses includes determining updated weights for the set of entity definitions. 14. A method for determining, by a language model trainer implemented on one or more computing devices having a processor and a computer memory, a language model optimized for an in-domain application, the method comprising: receiving a training corpus comprising one or more words; receiving a predefined set of entity definitions that identifies particular entity types, and for each entity type includes explicitly enumerated entity instances of the entity type, wherein each entity instance is associated with a weight; determining a first language model based on the training corpus, the set of entity definitions, and one or more weights; for a number of iterations, each iteration using an iteration language model: (a) utilizing the iteration language model to determine a set of two or more alternative parses of the corpus, each alternative parse comprising a sequence of tokens, each token comprising one or more words, phrases, or entities; (b) determining updated weights for the set of entity definitions based on each alternative parse in the set of two or more alternative parses; (c) generating an updated language model from each alternative parse in the set of alternative parses determined in step (a) and from the updated weights determined in step (b); and (d) determining a language model evaluation; storing the updated language model in the computer memory accessible to the in-domain application; wherein the iteration language model is the first language model for the first iteration, and wherein the iteration language model is the updated language model determined in step (c) for each subsequent iteration; and wherein the number of iterations is determined based on the language model evaluation. 15. The method of claim 14 , wherein evaluating the language model comprises determining that the language model has achieved convergence, and wherein the number of iterations is t

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9972311B2 cover?
Systems and methods are provided for optimizing language models for in-domain applications through an iterative, joint-modeling approach that expresses training material as alternative representations of higher-level tokens, such as named entities and carrier phrases. From a first language model, an in-domain training corpus may be represented as a set of alternative parses of tokens. Statistic…
Who is the assignee on this patent?
Microsoft Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).