Information extraction in a natural language understanding system
US-9454525-B2 · Sep 27, 2016 · US
US9886950B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9886950-B2 |
| Application number | US-201414479949-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 8, 2014 |
| Priority date | Sep 8, 2013 |
| Publication date | Feb 6, 2018 |
| Grant date | Feb 6, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Technologies for automatic domain model generation include a computing device that accesses an n-gram index of a web corpus. The computing device generates a semantic graph of the web corpus for a relevant domain using the n-gram index. The semantic graph includes one or more related entities that are related to a seed entity. The computing device performs similarity discovery to identify and rank contextual synonyms within the domain. The computing device maintains a domain model including intents representing actions in the domain and slots representing parameters of actions or entities in the domain. The computing device performs intent discovery to discover intents and intent patterns by analyzing the web corpus using the semantic graph. The computing device performs slot discovery to discover slots, slot patterns, and slot values by analyzing the web corpus using the semantic graph. Other embodiments are described and claimed.
Opening claim text (preview).
The invention claimed is: 1. A computing device for domain model creation, the computing device comprising: a web corpus module to access an n-gram index of a web corpus, wherein the web corpus includes a plurality of entities, wherein the n-gram index is indicative of a plurality of n-grams, wherein each n-gram comprises a predetermined number n of consecutive entities in the web corpus, and wherein the n-gram index is further indicative of a plurality of entities of each n-gram and a frequency of each n-gram; a semantic graph module to generate a semantic graph of the web corpus using the n-gram index of the web corpus, wherein the semantic graph is rooted by a predefined seed entity and includes a first plurality of related entities, wherein each of the first plurality of related entities is grammatically related to the seed entity and each of the first plurality of related entities is included in a corresponding n-gram of the web corpus that also includes the seed entity, and wherein to generate the semantic graph comprises to: retrieve a first plurality of n-grams from the web corpus using the n-gram index, wherein each of the first plurality of n-grams includes the seed entity; tag each entity of the first plurality of n-grams for part-of-speech; and identify a grammatical relationship between the seed entity and each of the first plurality of related entities in response to tagging of each entity, wherein each of the first plurality of related entities is included in the first plurality of n-grams; a similarity discovery module to analyze the web corpus using the semantic graph to identify and rank contextual synonyms for entities within a domain, wherein the semantic graph is further expanded using the ranked contextual synonyms; an intent discovery module to analyze the web corpus using the semantic graph to identify intents and intent patterns in the domain, wherein each intent is associated with a domain action, and each intent pattern matches query features and a corresponding intent; and a slot discovery module to analyze the web corpus using the semantic graph to identify slots, slot patterns, and slot values in the domain, wherein each slot is associated with a parameter of an intent or an entity, each slot pattern matches query features and a corresponding slot, and each slot value is associated with an entity. 2. The computing device of claim 1 , wherein to generate the semantic graph comprises to: score each of the first plurality of related entities. 3. The computing device of claim 2 , wherein to score each of the first plurality of related entities comprises to: determine a first number of n-grams in the first plurality of n-grams; determine a second number of n-grams in the first plurality of n-grams that each include a related entity of the first plurality of related entities; and determine a web relation frequency as a function of a frequency of the second number of n-grams in the first number of n-grams. 4. The computing device of claim 2 , wherein to score each of the first plurality of related entities comprises to calculate an indicative segment frequency in the web corpus and a normalized indicative segment frequency in the web corpus for the corresponding related entity. 5. The computing device of claim 4 , wherein to calculate the indicative segment frequency and the normalized indicative segment frequency comprises to: identify a plurality of segments including the corresponding related entity, wherein each segment comprises a shortest part of an n-gram of the first plurality of n-grams that includes the seed entity and the corresponding related entity; and identify a most common segment of the plurality of segments as the indicative segment of the corresponding related entity. 6. The computing device of claim 5 , wherein to calculate the normalized indicative segment frequency comprises to: determine a probable frequency of occurrence in the web corpus of the entities of the indicative segment of the corresponding related entity; and divide the indicative segment frequency of the corresponding related entity by the probable frequency of occurrence. 7. The computing device of claim 1 , wherein to analyze the web corpus using the semantic graph to identify and rank contextual synonyms for entities within the domain comprises to: select related entities of the first plurality of related entities having a highest indicative segment normalized frequency as anchor entities; retrieve anchor n-grams from the web corpus, wherein each anchor n-gram includes the seed entity and an anchor entity; replace the seed entity of each anchor n-gram with a placeholder; retrieve candidate n-grams from the web corpus, wherein each candidate n-gram matches an anchor n-gram; identify entities of the candidate n-grams matching the placeholder of the corresponding anchor n-gram as similarity candidates; and score each of the similarity candidates based on similarity to the seed entity. 8. The computing device of claim 7 , wherein to score each of the similarity candidates comprises to: generate a contextual similarity score for the corresponding similarity candidate based on contextual features; generate a linguistic similarity score for the corresponding similarity candidate based on linguistic features; and determine a similarity score for the corresponding similarity candidate as a function of the corresponding contextual similarity score and the corresponding linguistic similarity score. 9. The computing device of claim 1 , further comprising a domain model module to add the intents, intent patterns, slots, slot patterns, and slot values to a domain model, wherein the domain model includes known intents, intent patters, slots, and slot patterns associated with the domain and an ontology including known slot values associated with the domain. 10. The computing device of claim 9 , wherein to analyze the web corpus using the semantic graph to identify the intents and the intent patterns in the domain comprises to: score a first plurality of verbs of the first plurality of related entities of the semantic graph by a number of group unique n-grams and an indicative segment normalized frequency of the corresponding verb; identify one or more unknown verbs of the first plurality of verbs, wherein each of the unknown verbs does not match an intent pattern of the domain model; determine a similarity score for each pair of an unknown verb and a verb of the intent patterns of the domain model; identify one or more similar verbs of the unknown verbs as a function of the corresponding similarity score for the unknown verb and the verb of the intent patterns of the domain model; generate, for each similar verb of the one or more similar verbs, a new intent pattern for the intent of the corresponding intent pattern of the domain model; cluster one or more remaining verbs of the unknown verbs to generate clusters of remaining verbs, wherein each of the remaining verbs is not a similar verb; generate, for each cluster of remaining verbs, an intent; and generate, for each remaining verb of the clusters of remaining verbs, an intent pattern associated with the intent for the corresponding cluster of remaining verbs. 11. The computing device of claim 9 , wherein to analyze the web corpus using the semantic graph to identify the slot values in the domain comprises to: score a first plurality of modifiers of the first plurality of related entities of the semantic graph by a number of group unique n-grams and an indicative segment normalized frequency; identify one or more known modifiers of the first plurality of modifiers, wherein each of the known modifiers matc
Lexical analysis, e.g. tokenisation or collocates · CPC title
Semantic analysis · CPC title
Phrasal analysis, e.g. finite state techniques or chunking · CPC title
Creation of semantic tools, e.g. ontology or thesauri · CPC title
Probabilistic grammars, e.g. word n-grams · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.