Systematic tuning of text analytic annotators with specialized information

US10169334B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10169334-B2
Application numberUS-201514669555-A
CountryUS
Kind codeB2
Filing dateMar 26, 2015
Priority dateAug 14, 2014
Publication dateJan 1, 2019
Grant dateJan 1, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data structure is generated containing enumerators for data types of a domain, text forms of the enumerators and context patterns for the text forms. The data structure also includes information extraction rules that are associated with the enumerators. The data structure is updated with additional context patterns and text forms that are identified within a set of documents to which text analytic annotators are to be tuned. The set of documents are analyzed against the updated data structure and additional extraction rules are generated based on the analysis.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of tuning one or more text analytic annotators comprising: generating, via a processor, a data structure including domain information and one or more information extraction rules, wherein the domain information includes one or more enumerators associated with one or more data types defining respective information categories of the domain, one or more text forms associated with one or more of the enumerators representing forms of the enumerators appearing in text, and one or more context patterns associated with one or more of the text forms, wherein the one or more extraction rules are associated with the enumerators, and wherein the domain information is generic with respect to requirements of more than one organization; tuning, via the processor, the one or more extraction rules to a specified set of unannotated documents with specialized information including domain specific terminology of a particular organization, wherein the tuning includes: identifying one or more additional new context patterns within the set of unannotated documents for the enumerators of the generic domain information of the data structure in a first iteration through the set of unannotated documents, wherein the first iteration: determines exact matches between tokens within the set of unannotated documents and the one or more text forms associated with the enumerators of the generic domain information; identifies enumerators of the generic domain information in the set of unannotated documents in response to context patterns of tokens of exact matches within the set of unannotated documents matching context patterns associated with the enumerators of the generic domain information; and extracts the new context patterns from the set of unannotated documents for enumerators of the specialized information in response to context patterns of tokens of exact matches within the set of unannotated documents not matching context patterns associated with the enumerators of the generic domain information; identifying one or more additional new context patterns and new text forms within the set of unannotated documents for enumerators of the specialized information in a second iteration through the set of unannotated documents, wherein the second iteration: determines partial matches between tokens within the set of unannotated documents and the one or more text forms associated with enumerators of the generic domain information, wherein the partial matches are based on matching n-grams having a length less than the tokens and text forms; extracts the new context patterns for tokens of the partial matches within the set of unannotated documents; and identifies the additional new context patterns and text forms in response to the extracted context patterns for the partial matches matching one of: the context patterns of the enumerators of the generic domain information and the context patterns of the specialized information from the first iteration; updating the data structure with the additional new context patterns and text forms from the first and second iterations without user intervention to expand the generic domain information to cover the specialized information; and analyzing the set of unannotated documents based on the updated data structure and generating one or more additional extraction rules based on the analysis; configuring, via the processor, one or more text analytic annotators for the specialized information based at least on the additional extraction rules, identified enumerators of the generic domain information, and enumerators of the specialized information; and processing documents with the specialized information via the configured text analytic annotators. 2. The computer-implemented method of claim 1 , wherein generating the data structure includes: generating the data structure from a closed set of the one or more data types that encompasses a closed set of information categories of a specified domain of discourse. 3. The computer-implemented method of claim 2 , wherein generating the data structure from the closed set of data types includes: generating the data structure from a closed set of the one or more enumerators for each of the closed set of data types that encompasses a closed set of information sub-categories of the information categories of the specified domain of discourse. 4. The computer-implemented method of claim 1 , wherein the domain information further includes one or more document section rules associated with one or more of the enumerators and including one or more section context patterns for content within a corresponding document section, wherein the computer-implemented method further includes: identifying document section rules of the generic domain information in the set of unannotated documents in the first iteration in response to context patterns of tokens of exact matches within the set of unannotated documents exactly matching section context patterns associated with the document section rules; identifying document section rules for enumerators of the specialized information in response to context patterns of tokens of exact or partial matches partially matching section context patterns associated with the document section rules; and identifying document section rules for the specialized information in the second iteration in response to context patterns of tokens of partial matches within the set of unannotated documents exactly matching at least one from a group of the section context patterns associated with the document section rules for the generic information and the specialized information.

Assignees

Inventors

Classifications

  • G06F40/169Primary

    Annotation, e.g. comment data or footnotes · CPC title

  • Filtering based on additional data, e.g. user or group profiles (filtering in web context G06F16/9535, G06F16/9536) · CPC title

  • of structured data, e.g. relational data · CPC title

  • Search customisation based on user profiles and personalisation · CPC title

  • Clustering; Classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10169334B2 cover?
A data structure is generated containing enumerators for data types of a domain, text forms of the enumerators and context patterns for the text forms. The data structure also includes information extraction rules that are associated with the enumerators. The data structure is updated with additional context patterns and text forms that are identified within a set of documents to which text ana…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/169. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 01 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).