Document-specific gazetteers for named entity recognition

US9836453B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9836453-B2
Application numberUS-201514837687-A
CountryUS
Kind codeB2
Filing dateAug 27, 2015
Priority dateAug 27, 2015
Publication dateDec 5, 2017
Grant dateDec 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for entity recognition employs document-level entity tags which correspond to mentions appearing in the document, without specifying their locations. A named entity recognition model is trained on features extracted from text samples tagged with document-level entity tags. A text document to be labeled is received, the text document being tagged with at least one document-level entity tag. A document-specific gazetteer is generated, based on the at least one document-level entity tag. The gazetteer includes a set of entries, one entry for each of a set of entity names. For a text sequence of the document, features for tokens of the text sequence are extracted. The features include document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries. Entity labels are predicted for the tokens in the text sequence with the named entity recognition model, based on the extracted features.

First claim

Opening claim text (preview).

What is claimed is: 1. An entity recognition method comprising: providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set of entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries, the document-specific features comprising at least 12 document-specific features; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. 2. The method of claim 1 , further comprising training the named entity recognition model. 3. An entity recognition method comprising: training a named entity recognition model on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence, wherein the training comprises: receiving annotated training samples, each training sample being tagged with at least one document-level entity tag having a mention in at least one of the text sequences of the training sample, each text sequences of the training sample being annotated with token-level entity labels; for each training sample, generating a document-specific gazetteer based on the at least one document-level entity tag of the annotated training sample, the document-specific gazetteer including a set of entity names; using the document-specific gazetteer, extracting features for tokens of each text sequence in the training sample, the features including document-specific features; and training the named entity recognition model with the extracted features and the token-level entity labels for each training sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. 4. The method of claim 1 , wherein the named entity recognition module is a conditional random field model. 5. The method of claim 1 , wherein the document-specific features are binary features. 6. The method of claim 1 , wherein the document-specific features include features selected from the group consisting of: a feature indicating whether a token matches an initial token of a gazetteer entity name of at least two tokens; a feature indicating whether a token matches an intermediate token of a gazetteer entity name of at least three tokens; a feature indicating whether a token matches a final token of a gazetteer entity name of at least two tokens; and a feature indicating whether a token matches a unigram gazetteer entity name. 7. The method of claim 6 , wherein the document-specific features include at least three of the features in the group. 8. The method of claim 6 , wherein at least some of the selected document-specific features are each associated with an entity name type selected from a plurality of entity name types. 9. The method of claim 8 , wherein the plurality of entity name types includes at least three entity name types. 10. The method of claim 8 , wherein the plurality of entity name types is selected from the group consisting of: a person name; a location name; an organization name; and a miscellaneous name which covers entity names that are not in other types. 11. The method of claim 1 , wherein the document-specific features for tokens matching at least a part of one of the entity names in the set of entity names comprise at least 12 document-specific features. 12. The method of claim 11 , wherein the document-specific features for tokens matching at least a part of one of the entity names in the set of entity names comprise at least 16 document-specific features. 13. The method of claim 7 , wherein the document-specific features further include features for gazetteer entries generated from links to other entries, the links being identified from a knowledge base entry corresponding to one of the document-level entity tags. 14. The method of claim 1 , wherein the at least one document-level entity tag for the text document to be labeled includes a name and a type selected from a predefined set of types. 15. An entity recognition method, comprising: providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag, the at least one document-level entity tag for the text document to be labeled having at least one mention in the text document which refers to that entity, and the at least one document-level entity tag not being aligned to a specific token or specific sequence of tokens in the text document; providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. 16. The method of claim 1 , wherein the generating of the document-specific gazetteer for the text document to be labeled includes retrieving aliases for the at least one document-level entity tag from a knowledge base entry for a respective entity name. 17. The method of claim 1 , wherein the method further comprises outputting information based on the predicted entity labels. 18. A computer program product comprising a non-transitory recording medium storing instructions, which when executed on a computer, causes the computer

Assignees

Inventors

Classifications

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • Entity relationship models · CPC title

  • Document management systems · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • G06F40/295Primary

    Named entity recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9836453B2 cover?
A method for entity recognition employs document-level entity tags which correspond to mentions appearing in the document, without specifying their locations. A named entity recognition model is trained on features extracted from text samples tagged with document-level entity tags. A text document to be labeled is received, the text document being tagged with at least one document-level entity …
Who is the assignee on this patent?
Conduent Business Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).