Systematic tuning of text analytic annotators
US-2016048499-A1 · Feb 18, 2016 · US
US10169334B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10169334-B2 |
| Application number | US-201514669555-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 26, 2015 |
| Priority date | Aug 14, 2014 |
| Publication date | Jan 1, 2019 |
| Grant date | Jan 1, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A data structure is generated containing enumerators for data types of a domain, text forms of the enumerators and context patterns for the text forms. The data structure also includes information extraction rules that are associated with the enumerators. The data structure is updated with additional context patterns and text forms that are identified within a set of documents to which text analytic annotators are to be tuned. The set of documents are analyzed against the updated data structure and additional extraction rules are generated based on the analysis.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of tuning one or more text analytic annotators comprising: generating, via a processor, a data structure including domain information and one or more information extraction rules, wherein the domain information includes one or more enumerators associated with one or more data types defining respective information categories of the domain, one or more text forms associated with one or more of the enumerators representing forms of the enumerators appearing in text, and one or more context patterns associated with one or more of the text forms, wherein the one or more extraction rules are associated with the enumerators, and wherein the domain information is generic with respect to requirements of more than one organization; tuning, via the processor, the one or more extraction rules to a specified set of unannotated documents with specialized information including domain specific terminology of a particular organization, wherein the tuning includes: identifying one or more additional new context patterns within the set of unannotated documents for the enumerators of the generic domain information of the data structure in a first iteration through the set of unannotated documents, wherein the first iteration: determines exact matches between tokens within the set of unannotated documents and the one or more text forms associated with the enumerators of the generic domain information; identifies enumerators of the generic domain information in the set of unannotated documents in response to context patterns of tokens of exact matches within the set of unannotated documents matching context patterns associated with the enumerators of the generic domain information; and extracts the new context patterns from the set of unannotated documents for enumerators of the specialized information in response to context patterns of tokens of exact matches within the set of unannotated documents not matching context patterns associated with the enumerators of the generic domain information; identifying one or more additional new context patterns and new text forms within the set of unannotated documents for enumerators of the specialized information in a second iteration through the set of unannotated documents, wherein the second iteration: determines partial matches between tokens within the set of unannotated documents and the one or more text forms associated with enumerators of the generic domain information, wherein the partial matches are based on matching n-grams having a length less than the tokens and text forms; extracts the new context patterns for tokens of the partial matches within the set of unannotated documents; and identifies the additional new context patterns and text forms in response to the extracted context patterns for the partial matches matching one of: the context patterns of the enumerators of the generic domain information and the context patterns of the specialized information from the first iteration; updating the data structure with the additional new context patterns and text forms from the first and second iterations without user intervention to expand the generic domain information to cover the specialized information; and analyzing the set of unannotated documents based on the updated data structure and generating one or more additional extraction rules based on the analysis; configuring, via the processor, one or more text analytic annotators for the specialized information based at least on the additional extraction rules, identified enumerators of the generic domain information, and enumerators of the specialized information; and processing documents with the specialized information via the configured text analytic annotators. 2. The computer-implemented method of claim 1 , wherein generating the data structure includes: generating the data structure from a closed set of the one or more data types that encompasses a closed set of information categories of a specified domain of discourse. 3. The computer-implemented method of claim 2 , wherein generating the data structure from the closed set of data types includes: generating the data structure from a closed set of the one or more enumerators for each of the closed set of data types that encompasses a closed set of information sub-categories of the information categories of the specified domain of discourse. 4. The computer-implemented method of claim 1 , wherein the domain information further includes one or more document section rules associated with one or more of the enumerators and including one or more section context patterns for content within a corresponding document section, wherein the computer-implemented method further includes: identifying document section rules of the generic domain information in the set of unannotated documents in the first iteration in response to context patterns of tokens of exact matches within the set of unannotated documents exactly matching section context patterns associated with the document section rules; identifying document section rules for enumerators of the specialized information in response to context patterns of tokens of exact or partial matches partially matching section context patterns associated with the document section rules; and identifying document section rules for the specialized information in the second iteration in response to context patterns of tokens of partial matches within the set of unannotated documents exactly matching at least one from a group of the section context patterns associated with the document section rules for the generic information and the specialized information.
Annotation, e.g. comment data or footnotes · CPC title
Filtering based on additional data, e.g. user or group profiles (filtering in web context G06F16/9535, G06F16/9536) · CPC title
of structured data, e.g. relational data · CPC title
Search customisation based on user profiles and personalisation · CPC title
Clustering; Classification · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.