Augmented Text Search with Syntactic Information
US-2016371253-A1 · Dec 22, 2016 · US
US9922017B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9922017-B2 |
| Application number | US-201615158891-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 19, 2016 |
| Priority date | May 19, 2016 |
| Publication date | Mar 20, 2018 |
| Grant date | Mar 20, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Aspects of processing misaligned annotations include receiving a tokenized document and offset annotation file at a processor. The tokenized document includes a source document and corresponding tokens resulting from a low-level segmentation process. Annotations from the annotation file are applied, in conjunction with tokenization rules, to the source document, and a misalignment responsive to the applying is determined. If the misalignment is caused by an offset mismatch, an offset number of characters between the position counts in the annotation file and the source document is calculated, and the position count in the annotation file is adjusted to coincide with the position count in the source document. If the misalignment is not caused by an offset mismatch, a current position count in the source document is reset to a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving a tokenized document and an offset annotation file at a computer processor, the tokenized document including a source document and corresponding tokens resulting from a low-level segmentation process; applying annotations from the offset annotation file, in conjunction with a set of tokenization rules, to the source document; determining an occurrence of a misalignment responsive to the applying, the misalignment indicating at least one of an annotation and associated position count in the annotation file does not coincide with a term at a corresponding position count in the source document, and the annotation file indicates a non-existent or inconsistent token in the source document; upon determining the misalignment is caused by an offset mismatch, calculating an offset number of characters between the position counts in the annotation file and the source document, and adjusting the position count in the annotation file to coincide with the position count in the source document; and upon determining the misalignment is not caused by an offset mismatch, resetting a current position count in the source document to at least one of a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained, and a next identified calibration annotation in the annotation file. 2. The method of claim 1 , wherein the set of tokenization rules is a first set of tokenization rules applied to the tokenized document, the method further comprising iteratively selecting and applying sets of tokenization rules until all sets of the tokenization rules have been exhausted or the tokenized document is successfully completed. 3. The method of claim 1 , further comprising tagging, during the applying the set of tokenization rules, a current position count in the annotation file with a calibration annotation, the calibration annotation used in determining the location of the most recent alignment. 4. The method of claim 1 , wherein the calibration annotation identifies a location of a known offset, character count, and token position in the source tokenized document. 5. The method of claim 1 , wherein the annotation file includes at least one of part-of-speech tags, named entity tags, syntactic inference tags, and semantic inference tags. 6. The method of claim 1 , further comprising storing one or more of the sets of the tokenization rules learned from previously processed annotation files. 7. The method of claim 1 , wherein the set of tokenization rules defines how elements of the tokenized document are processed, the elements including punctuation, hyphenation, and whitespace. 8. A system, comprising: a memory having computer readable instructions; and a processor for executing the computer readable instructions, the computer readable instructions including: receiving a tokenized document and an offset annotation file at the processor, the tokenized document including a source document and corresponding tokens resulting from a low-level segmentation process; applying annotations from the offset annotation file, in conjunction with a set of tokenization rules, to the source document; determining an occurrence of a misalignment responsive to the applying, the misalignment indicating at least one of an annotation and associated position count in the annotation file does not coincide with a term at a corresponding position count in the source document, and the annotation file refers to a non-existent or inconsistent token in the source document; upon determining the misalignment is caused by an offset mismatch, calculating an offset number of characters between the position counts in the annotation file and the source document, and adjusting the position count in the annotation file to coincide with the position count in the source document; and upon determining the misalignment is not caused by an offset mismatch, resetting a current position count in the source document to at least one of a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained, and a next identified calibration annotation in the annotation file. 9. The system of claim 8 , wherein the set of tokenization rules is a first set of tokenization rules applied to the tokenized document, wherein the instructions further include iteratively selecting and applying sets of tokenization rules until all sets of the tokenization rules have been exhausted or the tokenized document is successfully completed. 10. The system of claim 8 , wherein the instructions further include tagging, during the applying the set of tokenization rules, a current position count in the annotation file with a calibration annotation, the calibration annotation used in determining the location of the most recent alignment. 11. The system of claim 8 , wherein the calibration annotation identifies a location of a known offset, character count, and token position in the source tokenized document. 12. The system of claim 8 , wherein the annotation file includes at least one of part-of-speech tags and named entity tags. 13. The system of claim 8 , wherein the instructions further include storing one or more of the sets of the tokenization rules learned from previously processed annotation files. 14. The system of claim 8 , wherein the set of tokenization rules defines how elements of the tokenized document are processed, the elements including punctuation, hyphenation, and whitespace. 15. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the computer processor to perform: receiving a tokenized document and an offset annotation file, the tokenized document including a source document and corresponding tokens resulting from a low-level segmentation process; applying annotations from the offset annotation file, in conjunction with a set of tokenization rules, to the source document; determining an occurrence of a misalignment responsive to the applying, the misalignment indicating at least one of an annotation and associated position count in the annotation file does not coincide with a term at a corresponding position count in the source document, and the annotation file refers to a non-existent or inconsistent token in the source document; upon determining the misalignment is caused by an offset mismatch, calculating an offset number of characters between the position counts in the annotation file and the source document, and adjusting the position count in the annotation file to coincide with the position count in the source document; and upon determining the misalignment is not caused by an offset mismatch, resetting a current position count in the source document to at least one of a position count of a previous location in which a most recent alignment between the annotation file and the source document was ascertained, and a next identified calibration annotation in the annotation file. 16. The computer program product of claim 15 , wherein the set of tokenization rules is a first set of tokenization rules applied to the tokenized document, and wherein the program instructions executable by the processor further cause the computer processor to perform: iteratively selecting and applying sets of tokenization rules until all sets of the tokenization rules have been exhausted or the tokenized document is successfully completed. 17.
Document management systems · CPC title
Annotation, e.g. comment data or footnotes · CPC title
Physics · mapped topic
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.