Systems and methods for detection and correction of ocr text

US2023083000A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023083000-A1
Application numberUS-202217895818-A
CountryUS
Kind codeA1
Filing dateAug 25, 2022
Priority dateAug 27, 2021
Publication dateMar 16, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

OCR-text correction system and method embodiments are described. The OCR-text correction embodiments comprise or cooperate with a transformer-based sequence-to-sequence language model. The model is pretrained to denoise corrupted text and is fine-tuned using OCR-correction-specific examples. Text obtained at least in part through OCR is applied to the fine-tuned pretrained transformer model to detect at least one error in a subset of the text. Responsive to detecting the at least one error, the fine-tuned pretrained transformer model outputs an updated subset of the text to correct the at least one error.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method, comprising: accessing a machine learning model that is pretrained by a first training dataset, the machine learning model pretrained to perform a non-optical character recognition (non-OCR) task; adjusting the machine learning model using a second training dataset, the second training dataset comprising OCR samples, the machine learning model adjusted to perform an OCR task; receiving a document that includes text obtained at least in part through OCR; applying the adjusted machine learning model to the text to detect at least one error in a subset of the text; and outputting an updated subset of the text to correct the at least one error in the subset of the text. 2 . The method of claim 1 , wherein the pretrained transformer model is bidirectional autoregressive transformer model, the bidirectional autoregressive transformer model including: a bidirectional encoder configured to receive the text; and an autoregressive decoder configured to detect the at least one error in the text and correct the at least one error in the text by predicting original text. 3 . The method of claim 1 , wherein the first training dataset includes one or more of the following: token masking, token deletion, sentence permutation, document rotation, or text infilling. 4 . The method of any of claim 1 , wherein the second training dataset includes monograph and periodical example sentences. 5 . The method of claim 1 , wherein the fine-tuned pretrained transformer model is configured to perform the detection and correction of the at least one error in a single step. 6 . The method of claim 1 , wherein the transformer model is configured to correct the at least one error in the text without being trained on alignment characters. 7 . The method of claim 1 , wherein the first training dataset comprises fewer than 1,000 documents. 8 . The method of claim 1 , wherein the at least one error includes an oversegmentation error caused by incorrectly segmenting a single word into two separate words by OCR. 9 . The method of claim 1 , wherein the at least one error includes an undersegmentation error caused by incorrectly combining a plurality of words into a single word by OCR. 10 . The method of claim 1 , wherein the at least one error includes a misrecognized character error caused by incorrectly recognizing a character by OCR. 11 . The method of claim 1 , wherein the at least one error includes a missing character error caused by incorrectly omitting a character by OCR. 12 . The method of claim 1 , wherein the at least one error includes a hallucination error caused by incorrectly inserting a non-existing character by OCR. 13 . A computer system for detecting and/or correcting text, comprising: a processor; and memory in communication with the processor, the memory configured to store instructions that, when executed by the processor, cause the processor to: access a pretrained transformer model pretrained using a first training dataset; fine-tune the pretrained transformer model using a second training dataset; provide text obtained at least in part through optical character recognition (OCR); apply the text to the fine-tuned pretrained transformer model to detect at least one error in a subset of the text; and output an updated subset of the text by the fine-tuned pretrained transformer model to correct the at least one error in the subset of the text. 14 . The computer system of claim 13 , wherein the pretrained transformer model is bidirectional autoregressive transformer model including: a bidirectional encoder configured to receive the text; and an autoregressive decoder configured to detect the at least one error in the text and correct the at least one error in the text by predicting original text. 15 . The computer system of claim 13 wherein the first training dataset includes one or more of the following: token masking, token deletion, sentence permutation, document rotation, or text infilling. 16 . The computer system of claim 13 , wherein the second training dataset includes monograph and periodical example sentences. 17 . The computer system of claim 13 , wherein the fine-tuned pretrained transformer model is configured to perform the detection and correction of the at least one error in a single step. 18 . The computer system of claim 13 , wherein the transformer model is configured to correct the at least one error in the text without being trained on alignment characters. 19 . The computer system of claim 13 , wherein the first training dataset comprises fewer than 1,000 documents. 20 . A non-transitory computer readable storage medium configured to store code comprising instructions, wherein the instructions, when executed by a processor, cause the processor to: access a pretrained transformer model pretrained using a first training dataset; fine-tune the pretrained transformer model using a second training dataset; provide text obtained at least in part through optical character recognition (OCR); apply the text to the fine-tuned pretrained transformer model to detect at least one error in a subset of the text; and output an updated subset of the text by the fine-tuned pretrained transformer model to correct the at least one error in the subset of the text.

Assignees

Inventors

Classifications

  • Techniques for post-processing, e.g. correcting the recognition result · CPC title

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • G06V30/133Primary

    Evaluation of quality of the acquired characters · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023083000A1 cover?
OCR-text correction system and method embodiments are described. The OCR-text correction embodiments comprise or cooperate with a transformer-based sequence-to-sequence language model. The model is pretrained to denoise corrupted text and is fine-tuned using OCR-correction-specific examples. Text obtained at least in part through OCR is applied to the fine-tuned pretrained transformer model to …
Who is the assignee on this patent?
Ancestry Com Operations Inc
What technology area does this patent fall under?
Primary CPC classification G06V30/133. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 16 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).