Data capture from images of documents with fixed structure

US9754187B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9754187-B2
Application numberUS-201414571979-A
CountryUS
Kind codeB2
Filing dateDec 16, 2014
Priority dateMar 31, 2014
Publication dateSep 5, 2017
Grant dateSep 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

For extracting data from a document with fixed structure, we recognize key words in an image of the document; identify reference object based on these key words, create templates based on the identified reference objects; match the created templates against the image of the document while recognizing fields in the image of the document these templates; and select the best template using quality of the recognized field.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: acquiring an electronic image of a document with a fixed structure, wherein the fixed structure comprises field names and field values corresponding to the field names, and wherein the field names and the field values are located at set locations in the document; recognizing key words in the electronic image of the document, wherein the key words comprise the field names and the field values; matching one or more templates from a plurality of templates with the document, wherein the one or more templates comprise reference objects that specify areas in the electronic image of the document where permitted field values corresponding to field names are to be extracted, and wherein matching the one or more templates comprises matching the field names and the permitted field values from the one or more templates with the identified field names and the field values from the recognized key words; selecting, by a processor device, a template from the one or more templates based on a quality of a match between the field names and the permitted field values from the template with the identified field names and the field values from the recognized key words; and extracting the field values from the electronic image of the document using the selected template. 2. The method of claim 1 , further comprising performing distortion correction of the electronic image of the document. 3. The method of claim 2 , wherein performing the distortion correction comprises performing at least one of alignment of lines in the electronic image of the document, correction of skewing in the electronic image of the document, correction of geometry in the electronic image of the document, color correction in the electronic image of the document, restoration of blurred and unfocused areas in the electronic image of the document, and removal of noise from the electronic image of the document. 4. The method of claim 2 , wherein performing the distortion correction comprises identifying boundaries within the electronic image of the document. 5. The method of claim 4 , further comprising cropping the electronic image along the identified boundaries. 6. The method of claim 1 , further comprising applying at least one filter to the electronic image of the document. 7. The method of claim 1 , further comprising determining a type of the document based on the selected template. 8. The method of claim 1 , wherein the reference objects comprise regular expressions. 9. The method of claim 1 , wherein recognizing the key words in the electronic image of the document is based on additional information about the recognized key words. 10. The method of claim 1 , further comprising: computing qualities of matches between the field names and the permitted field values from the one or more templates and the identified field names and the field values from the recognized key words; identifying the one or more templates from the plurality of templates which have the qualities that are greater than a predetermined threshold; and retaining the identified one or more templates. 11. The method of claim 1 , further comprising computing a quality of the recognized key words based on recognized text in the recognized key words. 12. The method of claim 11 , further comprising, if the quality of the recognized key words is greater than a threshold value, exporting the recognized text. 13. The method of claim 1 , wherein the plurality of templates comprises at least one preexisting template. 14. The method of claim 1 , further comprising creating at least one of the plurality of templates based on at least one of the reference objects. 15. The method of claim 1 , further comprising recognizing the electronic image of the document using the selected template. 16. A system comprising: a processor device to: acquire an electronic image of a document with a fixed structure, wherein the fixed structure comprises field names and field values corresponding to the field names, and wherein the field names and the field values are located at set locations in the document; recognize key words in the electronic image of the document, wherein the key words comprise the field names and the field values; match one or more templates from a plurality of templates with the document, wherein the one or more templates comprise reference objects that specify areas in the electronic image of the document where permitted field values corresponding to field names are to be extracted, and wherein, to match the one or more templates, the processor device is further to match the field names and the permitted field values from the one or more templates with the identified field names and the field values from the recognized key words; select a template from the one or more templates based on a quality of a match between the field names and the permitted field values from the template with the identified field names and the field values from the recognized key words; and extract the field values from the electronic image of the document using the selected template. 17. The system of claim 16 , wherein the processor device is further to perform a distortion correction of the electronic image of the document. 18. The system of claim 17 , wherein, to perform the distortion correction, the processor device is to perform at least one of alignment of lines in the electronic image of the document, correction of skewing in the electronic image of the document, correction of geometry in the electronic image of the document, color correction in the electronic image of the document, restoration of blurred and unfocused areas in the electronic image of the document, and removal of noise from the electronic image of the document. 19. The system of claim 17 , wherein, to perform the distortion correction, the processor device is to identify boundaries within the electronic image of the document. 20. The system of claim 19 , wherein the processor device is further to crop the electronic image along the identified boundaries. 21. The system of claim 16 , wherein the processor device is further to apply at least one filter to the electronic image of the document. 22. The system of claim 16 , wherein the processor device is further to determine a type of the document based on the selected template. 23. The system of claim 16 , wherein the reference objects comprises regular expressions. 24. The system of claim 16 , wherein the processor device is to recognize the key words in the electronic image of the document based on additional information about the recognized key words. 25. The system of claim 16 , wherein the processor device is further to: compute qualities of matches between the field names and the permitted field values from the one or more templates and the identified field names and the field values from the recognized key words; identify the one or more templates from the plurality of templates which have the qualities that are greater than a predetermined threshold; and retain the identified one or more templates. 26. The system of claim 16 , the processor device is further to compute the quality of the recognized key words based on recognized text in the recognized key words. 27. The system of claim 26 , wherein, if the quality of the recognized key words is greater than a threshold value, the processor device is

Assignees

Inventors

Classifications

  • Determining representative reference patterns, e.g. averaging or distorting patterns; Generating dictionaries, e.g. user dictionaries · CPC title

  • Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries · CPC title

  • Character recognition · CPC title

  • Physics · mapped topic

  • G06K9/6255Primary

    Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9754187B2 cover?
For extracting data from a document with fixed structure, we recognize key words in an image of the document; identify reference object based on these key words, create templates based on the identified reference objects; match the created templates against the image of the document while recognizing fields in the image of the document these templates; and select the best template using quality…
Who is the assignee on this patent?
Abbyy Dev Llc
What technology area does this patent fall under?
Primary CPC classification G06V30/1914. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).