System and method for natural language processing using synthetic text
US-10025773-B2 · Jul 17, 2018 · US
US10489682B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10489682-B1 |
| Application number | US-201715851617-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 21, 2017 |
| Priority date | Dec 21, 2017 |
| Publication date | Nov 26, 2019 |
| Grant date | Nov 26, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An optical character recognition system employs a deep learning system that is trained to process a plurality of images within a particular domain to identify images representing text within each image and to convert the images representing text to textually encoded data. The deep learning system is trained with training data generated from a corpus of real-life text segments that are generated by a plurality of OCR modules. Each of the OCR modules produces a real-life image/text tuple, and at least some of the OCR modules produce a confidence value corresponding to each real-life image/text tuple. Each OCR module is characterized by a conversion accuracy substantially below a desired accuracy for an identified domain. Synthetically generated text segments are produced by programmatically converting text strings to a corresponding image where each text string and corresponding image form a synthetic image/text tuple.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for training a computerized deep learning system utilized by an optical character recognition system comprising the computer-implemented operations of: generating a plurality of synthetic text segments, by programmatically converting each of a plurality of text strings to a corresponding image, each text string and corresponding image forming a synthetic image/text tuple; generating a plurality of real-life text segments by processing from a corpus of document images, at least a subset of images from the corpus, with a plurality of OCR programs, each of the OCR programs processing each image from the subset to produce a real-life image/text tuple, and at least some of the OCR programs producing a confidence value corresponding to each real-life image/text tuple, and wherein each OCR program is characterized by a conversion accuracy substantially below a desired accuracy for an identified domain; storing the synthetic image/text tuple and the real-life image/text tuple to data storage as training data in a format accessible by the computerized deep learning system for training; and training the computerized deep learning system with the training data. 2. The computer-implemented method of claim 1 further comprising: augmenting the synthetic image/text tuples and the real-life image/text tuples data by adding noise to image portions of the tuples. 3. The computer-implemented method of claim 2 wherein adding noise to image portions of the tuples comprises: randomly selecting image portions of the tuples and superimposing to the selected image portions, noise selected from the group consisting of random speckled noise, random lines, random binarization threshold, white on black text. 4. The computer-implemented method of claim 2 wherein adding noise to image portions of the tuples comprises: randomly selecting image portions of the tuples and superimposing patterned noise to the selected image portions. 5. The computer-implemented method of claim 1 further comprising processing the image portions of the tuples to format the image portions into a fixed normative input employed by the computerized deep learning system. 6. The computer-implemented method of claim 5 wherein processing the image portions of the tuples to format the image portions into a fixed normative input employed by the computerized deep learning system comprises: scaling the image portion of each of the tuples to fit in a field of view of the computerized deep learning system. 7. The computer-implemented method of claim 5 wherein processing the image portions of the tuples to format the image portions into a fixed normative input employed by the computerized deep learning system comprises: centering the image portion of each of the tuples within a field of view of the computerized deep learning system. 8. The computer-implemented method of claim 1 further comprising: processing, for storage as training data, output of the OCR programs by employing statistical metrics to identify the highest quality tuples generated by the OCR programs. 9. The computer-implemented method of claim 8 wherein employing the statistical metrics comprises: selecting, between confidence metrics of equal value generated by two or more OCR programs, a confidence metric generated from a deep-learning based OCR program over confidence metrics generated from OCR programs not based on computerized deep learning; selecting segments in order of OCR confidence as indicated by confidence metric generated by an OCR program; and selecting segments for which the same text is generated by the OCR programs, and if the same text is not generated by the OCR programs then selecting segments having the least edit distance. 10. The computer-implemented method of claim 9 further comprising: identifying a subset of the real-life image/text tuples for labeling by humans, the subset characterized by a range of confidence values and differing outputs among the OCR programs for given segments. 11. The computer-implemented method of claim 1 further comprising modifying a font of the image portion of at least a subset of the synthetic image/text tuples. 12. The computer-implemented method of claim 1 wherein generating a plurality of synthetic text segments comprises randomly selecting sets of consecutive words from a text corpus comprising a set of fully-formed English language sentences. 13. The computer-implemented method of claim 1 wherein generating a plurality of synthetic text segments comprises randomly selecting sets of consecutive words from a text corpus characterized by common text elements in the identified domain. 14. The computer-implemented method of claim 12 further comprising modifying the selected sets of consecutive words to reflect biases of character types that occur in the identified domain. 15. The computer-implemented method of claim 13 further comprising modifying the selected sets of consecutive words to reflect biases of character types that occur in the identified domain. 16. The computer-implemented method of claim 14 further comprising generating the image portion of the synthetic image/text tuple in accordance with a randomly chosen font and font size. 17. The computer-implemented method of claim 15 further comprising generating the image portion of the synthetic image/text tuple in accordance with a randomly chosen font and font size. 18. A computerized optical character recognition system comprising: a computerized deep learning system trained to process a plurality of encoded images within a particular domain to identify images representing text within each encoded image and converting the encoded images representing text to textually encoded data; data storage for storing the encoded images representing text and textually encoded data; wherein the computerized deep learning system is trained with training data generated from a corpus of, real-life text segments generated by processing from a corpus of encoded document images, at least a subset of encoded images from the corpus, with a plurality of OCR modules, each of the OCR modules processing each encoded image from the corpus to produce a real-life image/text tuple, and at least some of the OCR modules producing a confidence value corresponding to each real-life image/text tuple, and wherein each OCR module is characterized by an conversion accuracy substantially below a desired accuracy for an identified domain; and synthetically generated text segments, generated by programmatically converting each of a plurality of text strings to a corresponding encoded image, each text string and corresponding encoded image forming a synthetic image/text tuple. 19. The computerized optical character recognition system of claim 18 wherein the real-life image/text tuples are processed to fit within a field of view of the computerized deep learning system, and wherein the synthetic image/text tuples are processed to reflect textual characteristics of the identified domain. 20. A computerized system for training a computerized deep learning system utilized by an optical character recognition system comprising: a processor configured to execute instructions that when executed cause the processor to: generate a plurality of synthetic text segments, by programmatically converting each of a plurality of text strings to a corresponding image, each text string and corresponding image forming a synthetic image/text tuple; and generate a plurality of
Interactive pattern learning with a human teacher · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Quantising the image signal · CPC title
Character recognition · CPC title
Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.