Semi-supervised data integration model for named entity classification
US-9292797-B2 · Mar 22, 2016 · US
US9836453B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9836453-B2 |
| Application number | US-201514837687-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 27, 2015 |
| Priority date | Aug 27, 2015 |
| Publication date | Dec 5, 2017 |
| Grant date | Dec 5, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for entity recognition employs document-level entity tags which correspond to mentions appearing in the document, without specifying their locations. A named entity recognition model is trained on features extracted from text samples tagged with document-level entity tags. A text document to be labeled is received, the text document being tagged with at least one document-level entity tag. A document-specific gazetteer is generated, based on the at least one document-level entity tag. The gazetteer includes a set of entries, one entry for each of a set of entity names. For a text sequence of the document, features for tokens of the text sequence are extracted. The features include document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries. Entity labels are predicted for the tokens in the text sequence with the named entity recognition model, based on the extracted features.
Opening claim text (preview).
What is claimed is: 1. An entity recognition method comprising: providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set of entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries, the document-specific features comprising at least 12 document-specific features; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. 2. The method of claim 1 , further comprising training the named entity recognition model. 3. An entity recognition method comprising: training a named entity recognition model on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence, wherein the training comprises: receiving annotated training samples, each training sample being tagged with at least one document-level entity tag having a mention in at least one of the text sequences of the training sample, each text sequences of the training sample being annotated with token-level entity labels; for each training sample, generating a document-specific gazetteer based on the at least one document-level entity tag of the annotated training sample, the document-specific gazetteer including a set of entity names; using the document-specific gazetteer, extracting features for tokens of each text sequence in the training sample, the features including document-specific features; and training the named entity recognition model with the extracted features and the token-level entity labels for each training sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. 4. The method of claim 1 , wherein the named entity recognition module is a conditional random field model. 5. The method of claim 1 , wherein the document-specific features are binary features. 6. The method of claim 1 , wherein the document-specific features include features selected from the group consisting of: a feature indicating whether a token matches an initial token of a gazetteer entity name of at least two tokens; a feature indicating whether a token matches an intermediate token of a gazetteer entity name of at least three tokens; a feature indicating whether a token matches a final token of a gazetteer entity name of at least two tokens; and a feature indicating whether a token matches a unigram gazetteer entity name. 7. The method of claim 6 , wherein the document-specific features include at least three of the features in the group. 8. The method of claim 6 , wherein at least some of the selected document-specific features are each associated with an entity name type selected from a plurality of entity name types. 9. The method of claim 8 , wherein the plurality of entity name types includes at least three entity name types. 10. The method of claim 8 , wherein the plurality of entity name types is selected from the group consisting of: a person name; a location name; an organization name; and a miscellaneous name which covers entity names that are not in other types. 11. The method of claim 1 , wherein the document-specific features for tokens matching at least a part of one of the entity names in the set of entity names comprise at least 12 document-specific features. 12. The method of claim 11 , wherein the document-specific features for tokens matching at least a part of one of the entity names in the set of entity names comprise at least 16 document-specific features. 13. The method of claim 7 , wherein the document-specific features further include features for gazetteer entries generated from links to other entries, the links being identified from a knowledge base entry corresponding to one of the document-level entity tags. 14. The method of claim 1 , wherein the at least one document-level entity tag for the text document to be labeled includes a name and a type selected from a predefined set of types. 15. An entity recognition method, comprising: providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag, the at least one document-level entity tag for the text document to be labeled having at least one mention in the text document which refers to that entity, and the at least one document-level entity tag not being aligned to a specific token or specific sequence of tokens in the text document; providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. 16. The method of claim 1 , wherein the generating of the document-specific gazetteer for the text document to be labeled includes retrieving aliases for the at least one document-level entity tag from a knowledge base entry for a respective entity name. 17. The method of claim 1 , wherein the method further comprises outputting information based on the predicted entity labels. 18. A computer program product comprising a non-transitory recording medium storing instructions, which when executed on a computer, causes the computer
Clustering; Classification · CPC title
Entity relationship models · CPC title
Document management systems · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Named entity recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.