Information extraction and annotation systems and methods for documents

US10102193B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10102193-B2
Application numberUS-201414452484-A
CountryUS
Kind codeB2
Filing dateAug 5, 2014
Priority dateJul 22, 2013
Publication dateOct 16, 2018
Grant dateOct 16, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Information extraction and annotation systems and methods for use in annotating and determining annotation instances are provided herein. Exemplary methods include receiving annotated documents, the annotated documents comprising annotated fields, analyzing the annotated documents to determine contextual information for each of the annotated fields, determining discriminative sequences using the contextual information, generating a proposed rule or a feature set using the discriminative sequences and annotated fields, and providing the proposed rule or the feature set to a document annotator.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving, by a context analysis module, annotated documents, the annotated documents comprising annotated fields; analyzing, by the context analysis module, the annotated documents to determine contextual information for each of the annotated fields; determining discriminative sequences using the contextual information by: determining, by a contiguity heuristics module, longest contiguous common subsequences between aligned pairs of strings of the annotated documents; determining, by the contiguity heuristics module, a frequency of occurrence of similar longest contiguous common subsequences; and wherein the contiguity heuristics module generates a proposed rule from longest contiguous common subsequences having a desired frequency of occurrence; providing, by the context analysis module, the proposed rule to a document annotator; and applying, by a rule-based annotator, a rule-based extractor to a target document to create an annotated document, the rule-based extractor being generated from the proposed rule. 2. The method of claim 1 , wherein discriminative sequences are identified within the annotated documents and used to develop both rule-based extractors and feature-based extractors. 3. The method of claim 1 , wherein sequence alignment is utilized to identify discriminative sequences in the context of a specified field. 4. The method of claim 1 , wherein annotation of a document may include associating a word, date, or other group of characters with a particular field. 5. The method of claim 1 , further comprising: executing, by the context analysis module, a base annotation of original documents to create documents with base annotations, the base annotations comprising basic categories of words or groups of characters; and providing the documents with base annotations to the document annotator via a user interface. 6. The method of claim 5 , further comprising highlighting each of the base annotations within the user interface. 7. The method of claim 1 , further comprising receiving feedback from the document annotator; and using the feedback to any of approve the proposed rule, modify the proposed rule, and reject the proposed rule. 8. The method of claim 1 , further comprising converting, by a rule-based extractor generator, the proposed rule into the rule-based extractor. 9. The method of claim 1 , wherein determining longest contiguous common subsequences comprises: aligning pairs of strings having possible contextual matches; normalizing the pairs of strings by extracting matching segments having a given length; and aggregating the normalized pairs of strings. 10. The method of claim 9 , further comprising applying a greedy contiguity heuristic to the aggregated normalized pairs of strings. 11. The method of claim 10 , wherein the greedy contiguity heuristic evaluates any of a number of matching segments, a number of gaps between segments, and variances between segment lengths. 12. A system, comprising: a processor; and logic encoded in one or more tangible media for execution by the processor, the logic when executed by the processor causing the system to perform operations comprising: receiving annotated documents comprising annotated fields; analyzing the annotated documents to determine contextual information for each of the annotated fields; determining discriminative sequences using the contextual information by: determining, by a contiguity heuristics module, longest contiguous common subsequences between aligned pairs of strings of the annotated documents; determining, by the contiguity heuristics module, a frequency of occurrence of similar longest contiguous common subsequences, wherein the contiguity heuristics module generates a proposed rule from longest contiguous common subsequences having a desired frequency of occurrence, wherein the proposed rule is generated using the discriminative sequences, the annotated fields and a feature set; and providing the proposed rule or the feature set to a document annotator; and wherein a rule-based annotator applies a rule-based extractor to a target document to create an annotated document, the rule-based extractor being generated from the proposed rule. 13. The system of claim 12 , wherein the discriminative sequences are identified within the annotated documents and used to develop both rule-based extractors and feature-based extractors. 14. The system of claim 12 , wherein the processor further executes the logic to perform operations of: executing a base annotation of original documents to create documents with base annotations, the base annotations comprising basic categories of words or groups of characters; and providing the documents with base annotations to the document annotator via a user interface. 15. The system of claim 14 , wherein the processor further executes the logic to perform operations of highlighting each of the base annotations within the user interface. 16. The system of claim 12 , wherein the processor further executes the logic to perform operations of receiving feedback from a document annotator; and using the feedback to any of approve the proposed rule, modify the proposed rule, and reject the proposed rule. 17. The system of claim 12 , further comprising a rule-based extractor generator that converts the proposed rule into the rule-based extractor. 18. The system according to claim 17 , further comprising the rule-based annotator that applies the rule-based extractor to a target document to create an annotated document. 19. A method, comprising: receiving, by a context analysis module, annotated documents, the annotated documents comprising annotated fields; analyzing, by the context analysis module, the annotated documents to determine contextual information for each of the annotated fields; determining discriminative sequences using the contextual information by: determining, by a contiguity heuristics module, longest contiguous common subsequences between aligned pairs of strings of the annotated documents; and determining, by the contiguity heuristics module, a frequency of occurrence of similar longest contiguous common subsequences, wherein the contiguity heuristics module generates a proposed rule from longest contiguous common subsequences having a desired frequency of occurrence; generating, by the context analysis module, a feature set using the discriminative sequences and the annotated fields; providing, by the context analysis module, the feature set to a document annotator; and applying, by a feature-based annotator, a feature-based extractor to a target document to create an annotated document, the feature-based extractor being generated from the feature set.

Assignees

Inventors

Classifications

  • G06F40/169Primary

    Annotation, e.g. comment data or footnotes · CPC title

  • Information retrieval; Database structures therefor; File system structures therefor · CPC title

  • Digital computing or data processing equipment or methods, specially adapted for specific functions (information retrieval, database structures or file system structures therefor G06F16/00) · CPC title

  • Physics · mapped topic

  • G06F17/241Primary

    Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10102193B2 cover?
Information extraction and annotation systems and methods for use in annotating and determining annotation instances are provided herein. Exemplary methods include receiving annotated documents, the annotated documents comprising annotated fields, analyzing the annotated documents to determine contextual information for each of the annotated fields, determining discriminative sequences using th…
Who is the assignee on this patent?
Recommind Inc, Open Text Holdings Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/169. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 16 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).