System and method for entity extraction from semi-structured text documents

US10489439B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10489439-B2
Application numberUS-201615098856-A
CountryUS
Kind codeB2
Filing dateApr 14, 2016
Priority dateApr 14, 2016
Publication dateNov 26, 2019
Grant dateNov 26, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for extracting entities from a text document includes, for at least a section of a text document, providing a first set of entities extracted from the at least a section, clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the document. Complete ones of the clusters of entities are identified. Patterns for extracting new entities are learned based on the complete clusters. New entities are extracted from incomplete clusters based on the learned patterns.

First claim

Opening claim text (preview).

What is claimed is: 1. An automated method for extracting entities from a text document comprising: for at least a section of a text document, extracting a first set of entities in predefined classes of entity from the at least a section, the extraction of the first set of entities comprising at least one of a rule-based extraction method and a probabilistic extraction method; identifying a location of each of the extracted entities in the at least a section of the document; clustering at least a subset of the extracted entities in the first set into clusters, based on the identified locations of the entities in the document; identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entities in the clusters and a number of the classes of entity within each entity cluster; learning patterns for extracting new entities based on the complete clusters; and extracting new entities from the incomplete clusters based on the learned patterns, wherein the extracting of the first set of entities, identifying complete clusters, learning patterns, and extracting new entities are performed with a processor device. 2. The method of claim 1 , wherein the text document is a resume. 3. The method of claim 1 , further comprising segmenting the document into sections and performing the extracting of entities, identifying complete clusters, learning patterns and extracting new entities for one of the sections. 4. The method of claim 1 , wherein the extraction of the first set of entities includes accessing a lexicon of entities to identify text sequences in the section which each match a respective entity in the lexicon. 5. The method of claim 1 , wherein the extracted entities are each labeled with an entity class from the predefined classes of entity classes. 6. The method of claim 1 , wherein the clustering includes ordering the extracted entities based on their locations in the document, initializing a cluster with a first of the ordered entities, adding a next entity to the first cluster if the distance to a representative location in the cluster is less than a threshold distance, and recomputing the cluster representative location, otherwise if the distance is greater than the threshold, initializing a next cluster. 7. The method of claim 1 , wherein the identifying complete clusters of entities from the clusters includes identifying clusters which include at least a threshold number of entity classes. 8. The method of claim 7 , wherein the threshold number is determined by identifying correlations between entities of the different classes occurring in a set of documents or document sections. 9. The method of claim 7 , wherein a threshold number of entity classes is defined for each section for a cluster in that section to be considered complete. 10. The method of claim 1 , wherein the learning patterns comprises training a CRF model based on the complete clusters and the extracting new entities based on the learned patterns comprises predicting new entities in incomplete clusters based on the trained CRF model. 11. The method of claim 1 , wherein the learning patterns comprises, for each of the clusters in the set of complete clusters, for a window of text which includes the cluster, chunking the text using a set of rules to generate a sequence of chunks and extracting features of the chunks in the sequence, the features being used to learn the patterns. 12. The method of claim 1 , further comprising, after extracting new entities based on the learned patterns, identifying new complete clusters which include the new entities and repeating the learning of patterns and extracting the new entities with the new complete clusters. 13. The method of claim 1 , further comprising, after extracting new entities based on the learned patterns, if incomplete clusters remain, applying at least one of: a) a back-off model trained on information extracted from other documents, and b) pseudo-relevance feedback, to identify additional new entities. 14. The method of claim 13 , wherein the back-off model is a CRF model. 15. The method of claim 1 , further comprising outputting the extracted new entities or information based thereon. 16. A computer program product comprising a non-transitory storage medium storing instructions, which when executed on a computer, causes the computer to perform the method of claim 1 . 17. A system comprising memory which stores instructions for performing the method of claim 1 and a processor in communication with the memory for executing the instructions. 18. A system for extracting entities from text documents comprising: a first entity extraction component for extracting a first set of entities from at least a section of a text document, each of the extracted entities being in one of a predefined set of entity classes; a second entity extraction component for extraction of new entities from the at least the section of the text document, the second entity extraction component comprising: a clustering component for clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the document and their entity classes, a cluster completeness component for identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entity classes in the clusters, and a pattern recognition component for learning patterns of the entity classes for extracting new entities based on the complete clusters and extracting new entities from the incomplete clusters based on the learned patterns; and a processor for implementing the first and second entity extraction components, clustering component, cluster completeness component, and pattern recognition component. 19. The system of claim 18 , further comprising at least one of: a segmentation component for segmenting the document into sections; a chunking component which for each of the clusters in the set of complete clusters, for a window of text which includes the cluster, chunks the text using a set of rules to generate a sequence of chunks and extracting features of the chunks in the sequence, the features being used to learn the patterns; and an output component which outputs the extracted new entities or information based thereon. 20. A method for extracting entities from a resume comprising: segmenting the resume into sections; extracting a first set of entities and respective entity class labels from the section with at least one of grammar rules, a probabilistic model, and a lexicon; clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the resume; identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entity class labels in the clusters; learning patterns for extracting new entities based on the class labels of the entities in the complete clusters; extracting new entities from the incomplete clusters, based on the learned patterns; and outputting information based on the extracted new entities in the resume, wherein the segmenting, extracting the first set of entities, clustering, identifying complete clusters, learning patterns, and extracting new entities are performed with a processor device.

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • using natural language analysis · CPC title

  • G06F16/353Primary

    into predefined classes · CPC title

  • Document management systems · CPC title

  • Reformulation based on results of preceding query · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10489439B2 cover?
A method for extracting entities from a text document includes, for at least a section of a text document, providing a first set of entities extracted from the at least a section, clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the document. Complete ones of the clusters of entities are identified. Patterns for extract…
Who is the assignee on this patent?
Xerox Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/353. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).