Misaligned annotation processing

US9922017B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9922017-B2
Application numberUS-201615158891-A
CountryUS
Kind codeB2
Filing dateMay 19, 2016
Priority dateMay 19, 2016
Publication dateMar 20, 2018
Grant dateMar 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of processing misaligned annotations include receiving a tokenized document and offset annotation file at a processor. The tokenized document includes a source document and corresponding tokens resulting from a low-level segmentation process. Annotations from the annotation file are applied, in conjunction with tokenization rules, to the source document, and a misalignment responsive to the applying is determined. If the misalignment is caused by an offset mismatch, an offset number of characters between the position counts in the annotation file and the source document is calculated, and the position count in the annotation file is adjusted to coincide with the position count in the source document. If the misalignment is not caused by an offset mismatch, a current position count in the source document is reset to a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving a tokenized document and an offset annotation file at a computer processor, the tokenized document including a source document and corresponding tokens resulting from a low-level segmentation process; applying annotations from the offset annotation file, in conjunction with a set of tokenization rules, to the source document; determining an occurrence of a misalignment responsive to the applying, the misalignment indicating at least one of an annotation and associated position count in the annotation file does not coincide with a term at a corresponding position count in the source document, and the annotation file indicates a non-existent or inconsistent token in the source document; upon determining the misalignment is caused by an offset mismatch, calculating an offset number of characters between the position counts in the annotation file and the source document, and adjusting the position count in the annotation file to coincide with the position count in the source document; and upon determining the misalignment is not caused by an offset mismatch, resetting a current position count in the source document to at least one of a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained, and a next identified calibration annotation in the annotation file. 2. The method of claim 1 , wherein the set of tokenization rules is a first set of tokenization rules applied to the tokenized document, the method further comprising iteratively selecting and applying sets of tokenization rules until all sets of the tokenization rules have been exhausted or the tokenized document is successfully completed. 3. The method of claim 1 , further comprising tagging, during the applying the set of tokenization rules, a current position count in the annotation file with a calibration annotation, the calibration annotation used in determining the location of the most recent alignment. 4. The method of claim 1 , wherein the calibration annotation identifies a location of a known offset, character count, and token position in the source tokenized document. 5. The method of claim 1 , wherein the annotation file includes at least one of part-of-speech tags, named entity tags, syntactic inference tags, and semantic inference tags. 6. The method of claim 1 , further comprising storing one or more of the sets of the tokenization rules learned from previously processed annotation files. 7. The method of claim 1 , wherein the set of tokenization rules defines how elements of the tokenized document are processed, the elements including punctuation, hyphenation, and whitespace. 8. A system, comprising: a memory having computer readable instructions; and a processor for executing the computer readable instructions, the computer readable instructions including: receiving a tokenized document and an offset annotation file at the processor, the tokenized document including a source document and corresponding tokens resulting from a low-level segmentation process; applying annotations from the offset annotation file, in conjunction with a set of tokenization rules, to the source document; determining an occurrence of a misalignment responsive to the applying, the misalignment indicating at least one of an annotation and associated position count in the annotation file does not coincide with a term at a corresponding position count in the source document, and the annotation file refers to a non-existent or inconsistent token in the source document; upon determining the misalignment is caused by an offset mismatch, calculating an offset number of characters between the position counts in the annotation file and the source document, and adjusting the position count in the annotation file to coincide with the position count in the source document; and upon determining the misalignment is not caused by an offset mismatch, resetting a current position count in the source document to at least one of a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained, and a next identified calibration annotation in the annotation file. 9. The system of claim 8 , wherein the set of tokenization rules is a first set of tokenization rules applied to the tokenized document, wherein the instructions further include iteratively selecting and applying sets of tokenization rules until all sets of the tokenization rules have been exhausted or the tokenized document is successfully completed. 10. The system of claim 8 , wherein the instructions further include tagging, during the applying the set of tokenization rules, a current position count in the annotation file with a calibration annotation, the calibration annotation used in determining the location of the most recent alignment. 11. The system of claim 8 , wherein the calibration annotation identifies a location of a known offset, character count, and token position in the source tokenized document. 12. The system of claim 8 , wherein the annotation file includes at least one of part-of-speech tags and named entity tags. 13. The system of claim 8 , wherein the instructions further include storing one or more of the sets of the tokenization rules learned from previously processed annotation files. 14. The system of claim 8 , wherein the set of tokenization rules defines how elements of the tokenized document are processed, the elements including punctuation, hyphenation, and whitespace. 15. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the computer processor to perform: receiving a tokenized document and an offset annotation file, the tokenized document including a source document and corresponding tokens resulting from a low-level segmentation process; applying annotations from the offset annotation file, in conjunction with a set of tokenization rules, to the source document; determining an occurrence of a misalignment responsive to the applying, the misalignment indicating at least one of an annotation and associated position count in the annotation file does not coincide with a term at a corresponding position count in the source document, and the annotation file refers to a non-existent or inconsistent token in the source document; upon determining the misalignment is caused by an offset mismatch, calculating an offset number of characters between the position counts in the annotation file and the source document, and adjusting the position count in the annotation file to coincide with the position count in the source document; and upon determining the misalignment is not caused by an offset mismatch, resetting a current position count in the source document to at least one of a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained, and a next identified calibration annotation in the annotation file. 16. The computer program product of claim 15 , wherein the set of tokenization rules is a first set of tokenization rules applied to the tokenized document, and wherein the program instructions executable by the processor further cause the computer processor to perform: iteratively selecting and applying sets of tokenization rules until all sets of the tokenization rules have been exhausted or the tokenized document is successfully completed. 17.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9922017B2 cover?
Aspects of processing misaligned annotations include receiving a tokenized document and offset annotation file at a processor. The tokenized document includes a source document and corresponding tokens resulting from a low-level segmentation process. Annotations from the annotation file are applied, in conjunction with tokenization rules, to the source document, and a misalignment responsive to…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/169. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).