Systems and methods for detection and correction of OCR text

US12456317B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12456317-B2
Application numberUS-202217895818-A
CountryUS
Kind codeB2
Filing dateAug 25, 2022
Priority dateAug 27, 2021
Publication dateOct 28, 2025
Grant dateOct 28, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

OCR-text correction system and method embodiments are described. The OCR-text correction embodiments comprise or cooperate with a transformer-based sequence-to-sequence language model. The model is pretrained to denoise corrupted text and is fine-tuned using OCR-correction-specific examples. Text obtained at least in part through OCR is applied to the fine-tuned pretrained transformer model to detect at least one error in a subset of the text. Responsive to detecting the at least one error, the fine-tuned pretrained transformer model outputs an updated subset of the text to correct the at least one error.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method, comprising: receiving a document that includes text obtained at least in part through OCR; applying an adjusted bidirectional-and-auto-regressive-transformers (BART) model to the text to detect at least one error in a subset of the text, the adjusted BART model having been adjusted from a BART model pretrained to perform a non-optical character recognition (non-OCR) task using a first training dataset comprising corrupted text data and the adjusted BART model further being adjusted from the BART model to perform an OCR task using a second training dataset comprising OCR samples; and generating, by applying the adjusted BART model to the text of the document, an updated subset of the text correcting the at least one error in the subset of the text. 2 . The computer-implemented method of claim 1 , wherein the BART model comprises: a bidirectional encoder configured to receive the text; and an autoregressive decoder configured to detect the at least one error in the text and correct the at least one error in the text by predicting original text. 3 . The computer-implemented method of claim 1 , wherein the first training dataset includes one or more of the following: token masking, token deletion, sentence permutation, document rotation, or text infilling. 4 . The computer-implemented method of any of claim 1 , wherein the second training dataset includes monograph and periodical example sentences. 5 . The computer-implemented method of claim 1 , wherein the adjusted BART model is configured to perform detection and correction of the at least one error in a single step. 6 . The computer-implemented method of claim 1 , wherein the adjusted BART model is configured to correct the at least one error in the text without being trained on alignment of characters between inputs to an encoder and outputs to a decoder. 7 . The computer-implemented method of claim 1 , wherein the first training dataset comprises fewer than 1,000 documents. 8 . The computer-implemented method of claim 1 , wherein the adjusted BART model has been fine-tuned from a BART-base checkpoint comprising weights arrived at based on pretraining. 9 . The computer-implemented method of claim 1 , wherein the at least one error includes an undersegmentation error caused by incorrectly combining a plurality of words into a single word by OCR or an oversegmentation error caused by incorrectly segmenting a single word into two separate words by OCR. 10 . The computer-implemented method of claim 1 , wherein the adjusted BART model has been fine-tuned utilizing a hugging face package. 11 . The computer-implemented method of claim 1 , wherein the at least one error includes a missing character error caused by incorrectly omitting a character by OCR or a misrecognized character error caused by incorrectly recognizing a character by OCR. 12 . The computer-implemented method of claim 1 , wherein the at least one error includes a hallucination error caused by incorrectly inserting a non-existing character by OCR. 13 . A computer system for detecting and/or correcting text, comprising: a processor; and memory in communication with the processor, the memory configured to store instructions that, when executed by the processor, cause the processor to: access an adjusted bidirectional-and-auto-regressive-transformers (BART) model, the adjusted BART model having been adjusted from a BART model pretrained to perform a non-optical character recognition (non-OCR) task using a first training dataset comprising corrupted text data and the adjusted BART model further being adjusted from the BART model to perform an OCR task using a second training dataset comprising OCR samples; provide text obtained at least in part through optical character recognition (OCR); apply the text to the adjusted BART model to detect at least one error in a subset of the text; and generate an updated subset of the text by the adjusted BART model correcting the at least one error in the subset of the text. 14 . The computer system of claim 13 , wherein the BART model comprises: a bidirectional encoder configured to receive the text; and an autoregressive decoder configured to detect the at least one error in the text and correct the at least one error in the text by predicting original text. 15 . The computer system of claim 13 wherein the first training dataset includes one or more of the following: token masking, token deletion, sentence permutation, document rotation, or text infilling. 16 . The computer system of claim 13 , wherein the second training dataset includes monograph and periodical example sentences. 17 . The computer system of claim 13 , wherein the adjusted BART model is configured to perform detection and correction of the at least one error in a single step. 18 . The computer system of claim 13 , wherein the adjusted BART model is configured to correct the at least one error in the text without being trained on alignment of characters between inputs to an encoder and outputs to a decoder. 19 . The computer system of claim 13 , wherein the first training dataset comprises fewer than 1,000 documents. 20 . A non-transitory computer readable storage medium configured to store code comprising instructions, wherein the instructions, when executed by a processor, cause the processor to: access an adjusted bidirectional-and-auto-regressive-transformers (BART) model, the adjusted BART model having been adjusted from a BART model pretrained to perform a non-optical character recognition (non-OCR) task using a first training dataset comprising corrupted text data and the adjusted BART model further being adjusted from the BART model to perform an OCR task using a second training dataset comprising OCR samples; provide text obtained at least in part through optical character recognition (OCR); apply the text to the adjusted BART model to detect at least one error in a subset of the text; and generate an updated subset of the text by the adjusted BART model correcting the at least one error in the subset of the text.

Assignees

Inventors

Classifications

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Techniques for post-processing, e.g. correcting the recognition result · CPC title

  • G06V30/133Primary

    Evaluation of quality of the acquired characters · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12456317B2 cover?
OCR-text correction system and method embodiments are described. The OCR-text correction embodiments comprise or cooperate with a transformer-based sequence-to-sequence language model. The model is pretrained to denoise corrupted text and is fine-tuned using OCR-correction-specific examples. Text obtained at least in part through OCR is applied to the fine-tuned pretrained transformer model to …
Who is the assignee on this patent?
Ancestry Com Operations Inc
What technology area does this patent fall under?
Primary CPC classification G06V30/133. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 28 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).