Methods and apparatus to extract text from imaged documents

US9684842B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9684842-B2
Application numberUS-201514927014-A
CountryUS
Kind codeB2
Filing dateOct 29, 2015
Priority dateOct 29, 2015
Publication dateJun 20, 2017
Grant dateJun 20, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and apparatus to extract text from imaged documents are disclosed. Example methods include segmenting an image of a document into localized sub-images corresponding to individual characters in the document. The example methods further include grouping respective ones of the sub-images into a cluster based on a visual correlation of the respective ones of the sub-images to a reference sub-image. The visual correlation between the reference sub-image and the respective ones of the sub-images grouped into the cluster exceeding a correlation threshold. The example methods also include identifying a designated character for the cluster based on the sub-images grouped into the cluster. The example methods further include associating the designated character with locations in the image of the document associated with the respective ones of the sub-images grouped into the cluster.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: segmenting, by executing an instruction with a processor, an image of a document into localized sub-images corresponding to individual characters in the document; grouping, by executing an instruction with the processor, respective ones of the sub-images into a cluster based on visual correlations of the respective ones of the sub-images to a reference sub-image, the visual correlations between the reference sub-image and the respective ones of the sub-images grouped into the cluster exceeding a correlation threshold; identifying, by executing an instruction with the processor, a designated character for one representative sub-image associated with the cluster; assigning, by executing an instruction with the processor, the designated character to the respective ones of the sub-images grouped into the cluster; and associating, by executing an instruction with the processor, the designated character with locations in the image of the document associated with the respective ones of the sub-images grouped into the cluster. 2. The method of claim 1 , wherein the designated character is identified without using an optical character recognition engine. 3. The method of claim 1 , further including identifying the designated character for the representative sub-image by: presenting the representative sub-image associated with the cluster to a human reviewer; and receiving feedback from the human reviewer indicating the designated character. 4. The method of claim 1 , wherein the designated character is identified based on optical character recognition of the representative sub-image associated with the cluster. 5. The method of claim 4 , wherein the representative sub-image corresponds to a first one of the sub-images grouped into the cluster. 6. The method of claim 4 , wherein the representative sub-image is a composite of the respective ones of the sub-images grouped into the cluster. 7. A method comprising: segmenting, by executing an instruction with a processor, an image of a document into localized sub-images corresponding to individual characters in the document; grouping, by executing an instruction with the processor, respective ones of the sub-images into a cluster based on visual correlations of the respective ones of the sub-images to a reference sub-image, the visual correlations between the reference sub-image and the respective ones of the sub-images grouped into the cluster exceeding a correlation threshold; identifying, by executing an instruction with the processor, a designated character for the cluster based on the sub-images grouped into the cluster; and associating, by executing an instruction with the processor, the designated character with locations in the image of the document associated with the respective ones of the sub-images grouped into the cluster, the method further including determining the visual correlation of a first one of the sub-images to the reference sub-image by: transforming the first one of the sub-images to have a spatial orientation corresponding to the reference sub-image to determine a transformed sub-image; adding a margin around the transformed sub-image; calculating a correlation value between the transformed sub-image and the reference sub-image for different positions of the reference sub-image relative to the transformed sub-image within a boundary defined by the margin; and assigning a largest one of the correlation values as the visual correlation of the first one of the sub-images to the reference sub-image. 8. The method of claim 1 , further including: determining a reliability of the designated character based on an output of an optical character recognition analysis of the representative sub-image for the cluster; and automatically assigning the designated character to the respective ones of the sub-images grouped into the cluster when the designated character is determined to be reliable. 9. The method of claim 8 , further including: comparing the representative sub-image to a stored sub-image associated with a stored designated character previously verified by a human reviewer when the designated character is determined to be unreliable; and automatically assigning the stored designated character as the designated character when a visual correlation between the representative sub-image and the stored sub-image exceeds the correlation threshold. 10. The method of claim 8 , further including prompting a human reviewer for verification of the designated character when the designated character is determined to be unreliable. 11. The method of claim 10 , wherein the designated character is determined to be unreliable when a location error value generated by the optical character recognition analysis of the representative sub-image for the cluster does not satisfy a location error threshold, the location error value corresponding to a difference between a location of a boundary of the designated character identified within the representative sub-image and a boundary of the representative sub-image. 12. The method of claim 10 , wherein the designated character is determined to be unreliable when a confidence value generated by the optical character recognition analysis does not satisfy a confidence threshold. 13. The method of claim 12 , wherein prompting the human reviewer for verification of the designated character includes: when the confidence value does not satisfy the confidence threshold and satisfies a confirmation threshold, displaying the designated character alongside the representative sub-image and requesting the human reviewer to confirm the designated character corresponds to the representative sub-image, and when the confidence value does not satisfy the confirmation threshold, displaying the representative sub-image and requesting the human reviewer to identify the representative sub-image. 14. A method comprising: segmenting, by executing an instruction with a processor, an image of a document into localized sub-images corresponding to individual characters in the document; grouping, by executing an instruction with the processor, respective ones of the sub-images into a cluster based on visual correlations of the respective ones of the sub-images to a reference sub-image, the visual correlations between the reference sub-image and the respective ones of the sub-images grouped into the cluster exceeding a correlation threshold; identifying, by executing an instruction with the processor, a designated character for the cluster based on the sub-images grouped into the cluster; and associating, by executing an instruction with the processor, the designated character with locations in the image of the document associated with the respective ones of the sub-images grouped into the cluster, the method further including: determining, by executing an instruction with the processor, a reliability of the designated character based on an output of an optical character recognition analysis of a representative sub-image for the cluster; automatically assigning, by executing an instruction with the processor, the designated character to the cluster when the designated character is determined to be reliable; and prompting, by executing an instruction with the processor, a human reviewer for verification of the designated character when the designated character is determined to be unreliable, wherein the designated character is determined to be unreliable when an amount of foreground pixels within the representative sub-image and outside a boundary of the designated character identified within the representative sub-

Assignees

Inventors

Classifications

  • with the intervention of an operator · CPC title

  • Non-hierarchical techniques, e.g. based on statistics of modelling distributions · CPC title

  • G06V30/153Primary

    using recognition of characters or words · CPC title

  • Clustering techniques · CPC title

  • using statistics or function optimisation, e.g. modelling of probability density functions · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9684842B2 cover?
Methods and apparatus to extract text from imaged documents are disclosed. Example methods include segmenting an image of a document into localized sub-images corresponding to individual characters in the document. The example methods further include grouping respective ones of the sub-images into a cluster based on a visual correlation of the respective ones of the sub-images to a reference su…
Who is the assignee on this patent?
Nielsen Co Us Llc
What technology area does this patent fall under?
Primary CPC classification G06V30/153. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 20 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).