Automatic disambiguation based on a reference resource

US9772992B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9772992-B2
Application numberUS-201213342285-A
CountryUS
Kind codeB2
Filing dateJan 3, 2012
Priority dateFeb 26, 2007
Publication dateSep 26, 2017
Grant dateSep 26, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A novel system for automatically indicating the specific identity of ambiguous named entities is provided. An automatic disambiguation data collection is created using a reference resource. Explicit named entities are catalogued from the reference resource, together with various abbreviated, alternative, and casual ways of referring to the named entities. Entity indicators, such as labels and context indicators associated with the named entities in the reference resource, are also catalogued. The automatic disambiguation collection can then be used as a basis for evaluating ambiguous references to named entities in text content provided in different applications. The content surrounding the ambiguous reference may be compared with the entity indicators to find a good match, indicating that the named entity associated with the matching entity indicators is the intended identity of the ambiguous reference, which can be automatically provided to a user.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system comprising: a processor; and memory storing instructions which, when executed by the processor, configure the computing system to: identify a source text having a plurality of words; analyze the source text to identify a surface form in the source text, the surface form being an ambiguous orthographic representation of a proper name for an entity; based on the identification of the surface form in the source text, access a surface form record representing the surface form, the surface form record identifying at least a first named entity and a second named entity that are different from one another, and are each associated with the surface form and denoted by a proper name, wherein the surface form record comprises a first pointer to a first named entity record that is separate from the surface form record, the first named entity record corresponding to the first named entity and including a first set of context indicators that represents a context of the first named entity, and wherein the surface form record comprises a second pointer to a second named entity cord is separate from the surface form record, the second named entity record corresponding to the second named entity and including a second set of context indicators that represents a context of the second named entity; use the first pointer to retrieve the first set of context indicators from the first named entity record; generate a first correlation measure based on a number of occurrences in the source text of the first set of context indicators; use the second pointer to retrieve the second set of context indicators from the second named entity record; generate a second correlation measure based on a number of occurrences in the source text of the second set of context indicators; based on a comparison of the first and second correlation measures, select one of the first or second named entities as corresponding to the surface form in the source text; and generate a representation of a user interface display that displays the source text and visually associates the surface form and the selected named entity. 2. The computing system of claim 1 , wherein the instructions configure the computing system to generate the user interface display to highlight the surface form and include a textual representation of the selected named entity displayed proximate the highlighted surface form. 3. The computing system of claim 1 , wherein the surface form record is accessed in a surface form data store that stores a plurality of surface form records, each of the surface form records corresponding to a surface form that is an ambiguous orthographic representation of a proper name for an entity, and each of the surface form records having an indication of the corresponding surface form and indications of named entities that are associated with the surface form. 4. The computing system of claim 1 , wherein the surface form comprises a first surface form in the source text, and wherein the instructions configure the computing system to: identify a second surface form that overlaps the first surface form within the source text; analyze the second surface form to select a third named entity corresponding to the second surface form; determine that the named entity selected as corresponding to the first surface form has a higher correlation to the first surface form in than a correlation between the third named entity and the second surface form; and generate the representation of the user interface display based on determination. 5. The computing system of claim 1 , wherein the instructions configure the computing system to: select one of the first named entity or the second named entity that is considered most correlated with the surface form in the source text based on the first and second correlation measures. 6. The computing system of claim 1 , wherein the number of occurrences in the source text of the first set of context indicators is determined based on a comparison of the first set of context indicators to words in the source text, other than the surface form, and wherein the number occurrences in the source text of the second set of context indicators is determined based on a comparison of he second set of context indicators to words the source text, other than the surface form. 7. The computing system of claim 6 wherein the first correlation measure is based on a proximity of each occurrence of the first set of context indicators to the surface form in the source text, and the second correlation measure is based on a proximity of each occurrence of the second set of context indicators to the surface form in the source text. 8. The computing system of claim 1 , wherein each of the first and second named entities comprise only one word that have different spellings from one another. 9. The computing system of claim 1 , wherein at least one of the first or second named entities comprise multiple words, and wherein the first and second named entities have different spellings from one another. 10. The computing system of claim 1 , wherein the first correlation measure is generated based on a weighting factor that applies different weights to at least one of: different types of the context indicators; proximity of the context indicators to the surface form of the named entity within the text; a number of common labels; or a number of links among each other of context indicators extracted from documents that link to or are linked from a document about an associated named entity. 11. The computing system of claim 1 , wherein the surface form record includes a first entry that stores the first pointer, and a second entry that stores the second pointer. 12. The computing system claim 11 , wherein the first and second named entity records are stored in a named entity data store and accessed based on the first and second pointers. 13. The computing system of claim 12 , wherein the surface form record is stored in a surface form data store that is distinct from the named entity data store. 14. A computer-implemented method comprising: identifying a portion of text having a plurality of words; generating a representation of a user interface display that displays the portion of text; identifying a polysemic word in the portion of text; based on the identification of the polysemic word, accessing a surface form record representing the polysemic word, the surface form record identifying at least a first named entity and a second named entity that are different from one another and each associated with the polysemic word, wherein the surface form record comprises a first pointer to a first named entity record that is separate from the surface form record, the first named entity record corresponding to the first named entity and including a first set of context indicators that represents a context of the first named entity, and wherein the surface form record comprises a second pointer to a second named entity record that is separate from the surface form record, the second named entity record corresponding to the second named entity and including a second set of context indicators that represents a context of the second named entity; based on the first pointer, retrieving the first set of context indicators from the first named entity record; generating a first correlation measure based on a number of occurrences in the source text of the first set of context indicators; based on the second pointer, retrieving the second set of context indicators from the second named entity record; generating a second correlation measu

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9772992B2 cover?
A novel system for automatically indicating the specific identity of ambiguous named entities is provided. An automatic disambiguation data collection is created using a reference resource. Explicit named entities are catalogued from the reference resource, together with various abbreviated, alternative, and casual ways of referring to the named entities. Entity indicators, such as labels and c…
Who is the assignee on this patent?
Cucerzan Silviu-Petru, Schultz Mike, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/295. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).