Semi-supervised data integration model for named entity classification

US9292797B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9292797-B2
Application numberUS-201213714667-A
CountryUS
Kind codeB2
Filing dateDec 14, 2012
Priority dateDec 14, 2012
Publication dateMar 22, 2016
Grant dateMar 22, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

According to one embodiment, a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data is provided. Training data are compared to named entity candidates taken from the first repository to form a positive training seed set. A decision tree is populated and classification rules are created for classifying the named entity candidates. A number of entities are sampled from the named entity candidates. The sampled entities are labeled as positive examples and/or negative examples. The positive training seed set is updated to include identified commonality between the positive examples and the auxiliary repository. A negative training seed set is updated to include negative examples which lack commonality with the auxiliary repository. In view of both the updated positive and negative training seed sets, the decision tree and the classification rules are updated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising: comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, populating a decision tree; in view of populating the decision tree, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named entity candidates; in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples; in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository; in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules. 2. The method of claim 1 , comprising: repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the decision tree and the classification rules until a threshold condition is met, the threshold condition comprising one of: a maximum number of iterations and a change in a number of rules in the classification rules between iterations. 3. The method of claim 1 , comprising: performing the method for each of a plurality of named entity types to determine the classification rules for each of the named entity types, wherein the training data comprise a plurality of data sources comprising only positive examples associated with each of the plurality of named entity types. 4. The method of claim 1 , comprising: removing aliases from the first repository to determine the named entity candidates; eliminating common stop words and non-content-bearing words from candidate entity content of the named entity candidates; populating a feature dictionary in view of high frequency words in the candidate entity content of the named entity candidates; and representing the candidate entity content as a vector space model by applying weights to each word of the feature dictionary in the candidate entity content. 5. The method of claim 1 , comprising: preprocessing the auxiliary repository to remove false positive examples. 6. The method of claim 1 , comprising: applying a plurality of resolution rules to identify both exact matches and similar matches. 7. The method of claim 1 , wherein the decision tree comprises a plurality of tree nodes, the method comprising: determining whether to grow the decision tree by splitting one or more of the tree nodes into child nodes; and determining whether to prune the decision tree to remove child nodes that lack a meaningful distinction between them. 8. A computer program product for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the computer program product comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code being executable by a computer to perform a method comprising: comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, populating a decision tree; in view of populating the decision tree, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named entity candidates; in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples; in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository; in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules. 9. The computer program product of claim 8 , comprising: repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the decision tree and the classification rules until a threshold condition is met, the threshold condition comprising one of: a maximum number of iterations and a change in a number of rules in the classification rules between iterations. 10. The computer program product of claim 8 , comprising: performing the method for each of a plurality of named entity types to determine the classification rules for each of the named entity types, wherein the training data comprise a plurality of data sources comprising only positive examples associated with each of the plurality of named entity types. 11. The computer program product of claim 8 , comprising: removing aliases from the first repository to determine the named entity candidates; eliminating common stop words and non-content-bearing words from candidate entity content of the named entity candidates; populating a feature dictionary in view of high frequency words in the candidate entity content of the named entity candidates; representing the candidate entity content as a vector space model by applying weights to each word of the feature dictionary in the candidate entity content; and preprocessing the auxiliary repository to remove false positive examples. 12. The computer program product of claim 8 , comprising: applying a plurality of resolution rules to identify both exact matches and similar matches. 13. The computer program product of claim 8 , wherein the decision tree comprises a plurality of tree nodes, and the method further comprising: determining whether to grow the decision tree by splitting one or more of the tree nodes into child nodes; and determining whether to prune the decision tree to remove child nodes that lack a meaningful distinction between them. 14. A system for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the system comprising: memory having computer readable computer instructions; and a processor for executing the computer readable instructions to perform a method comprising: comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, populating a decision tree; in view of populating the decision tree, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named e

Assignees

Inventors

Classifications

  • G06N99/005Primary

    Physics · mapped topic

  • Knowledge representation; Symbolic representation · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9292797B2 cover?
According to one embodiment, a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data is provided. Training data are compared to named entity candidates taken from the first repository to form a positive training seed set. A decision tree is populated and classifica…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 22 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).