Automated document processing for detecting, extracting, and analyzing tables and tabular data

US11977533B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11977533-B2
Application numberUS-202217571327-A
CountryUS
Kind codeB2
Filing dateJan 7, 2022
Priority dateApr 2, 2021
Publication dateMay 7, 2024
Grant dateMay 7, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

According to one embodiment, a method for detecting, extracting information from, and classifying tables within an original image includes: pre-processing the original image to generate processed image data; detecting one or more tables within the processed image data; extracting the one or more tables from the processed image data; and classifying either: the one or more extracted tables; portions of the one or more extracted tables; or a combination thereof. Additional techniques for pre-processing image data to facilitate detection, extraction of information from, and classification of tables (or portions thereof) are also featured. Corresponding systems and computer program products are included in the scope of the invention. The inventive concepts are also applicable to tabular data arrangements that may not fit a strict definition of a “table.”

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for detecting and classifying tables and/or tabular data arrangements within an original image, comprising: pre-processing the original image to generate processed image data, wherein pre-processing the original image comprises identifying one or more delineating lines depicted in the original image, wherein identifying the one or more delineating lines comprises: obtaining a third set of rules defining criteria of delineating lines; evaluating the original image against the third set of rules; and generating a set of delineating lines based on the evaluation; detecting one or more tables and/or one or more tabular data arrangements within the processed image data; extracting the one or more tables and/or the one or more tabular data arrangements from the processed image data; and classifying either: the one or more extracted tables; portions of the one or more extracted tables; the one or more extracted tabular data arrangements; portions of the one or more extracted tabular data arrangements; or a combination of: the one or more extracted tables; the portions of the one or more extracted tables; the one or more extracted tabular data arrangements; and/or the portions of the one or more extracted tabular data arrangements. 2. The method as recited in claim 1 , wherein pre-processing the original image comprises grouping words into phrases, and wherein grouping the words into the phrases comprises: determining whether one or more boundaries between textual elements depicted in the original image are characterized by a width greater than an average width of whitespace characters depicted in the original image; and in response to determining at least one of the one or more boundaries is not characterized by a width greater than the average width of the whitespace characters depicted in the original image, grouping the corresponding textual elements to form one or more phrases. 3. The method as recited in claim 1 , wherein pre-processing the original image comprises detecting subpages, wherein detecting the subpages comprises: obtaining a set of rules defining criteria of subpages, wherein the criteria of subpages comprise: the original image including a vertical graphical line that spans a vertical extent of a page of a document depicted in the original image; and/or the original image depicting horizontally adjacent regions each having a plurality of textual elements and/or horizontal graphical lines exhibiting at least one common alignment characteristic; and evaluating the original image against the set of rules; and defining one or more subpages within the original image based on the evaluation. 4. The method as recited in claim 1 , wherein pre-processing the original image comprises performing layout analysis on the original image, wherein the layout analysis comprises identifying one or more excluded zones within the original image. 5. The method as recited in claim 1 , wherein detecting the one or more tables and/or the one or more tabular data arrangements comprises: performing grid-based detection; denoting one or more areas within the original image that include a grid-like table and/or a grid-like tabular data arrangement as an excluded zone; and performing non-grid-based detection on portions of the original image that are not denoted as excluded zones. 6. The method as recited in claim 1 , wherein pre-processing the original image comprises generating a first representation of the original image; and wherein the first representation excludes textual characters represented in the original image. 7. The method as recited in claim 6 , wherein generating the first representation does not create any graphical lines that are not represented in the original image. 8. The method as recited in claim 1 , wherein pre-processing the image data comprises: generating a first representation of the original image; identifying one or more horizontal graphical lines depicted in the original image, and/or one or more vertical graphical lines depicted in the original image; identifying one or more gaps in the one or more horizontal graphical lines and/or the one or more vertical graphical lines of the first representation; and restoring the one or more horizontal graphical lines and/or the one or more vertical graphical lines by filling in the one or more gaps. 9. A computer-implemented method for detecting one or more non-grid-like tables and/or one or more non-grid-like tabular data arrangements depicted in image data, the method comprising: conducting a first evaluation of the image data against a first set of rules defining characteristics of column seeds, and identifying a set of column seed candidates based on the first evaluation; conducting a second evaluation of the image data against a second set of rules defining characteristics of column clusters, and identifying a set of column cluster candidates based on the second evaluation; conducting a third evaluation of the image data against a third set of rules defining criteria for updating column clusters, and either or both of: reformulating one or more existing column definitions based on the third evaluation; and modifying a definition of some or all of the column cluster candidates based on the third evaluation; conducting a fourth evaluation of the image data against a fourth set of rules defining characteristics of row title columns, and identifying a set of row title column candidates based on the fourth evaluation; and defining a structure and a content of the one or more tables and/or the one or more tabular data arrangements based on a result of some or all of: the first evaluation; the second evaluation; the third evaluation; and the fourth evaluation. 10. The method as recited in claim 9 , wherein the characteristics of column seeds comprise: being an adjacent or nearly adjacent pair of elements that are located in a region of the original image that is not an excluded zone; being an adjacent or nearly adjacent pair of elements each independently comprising a same type of textual element, and not being separated by a different type of textual element; and/or being an adjacent or nearly adjacent pair of elements exhibiting a common alignment characteristic. 11. The method as recited in claim 9 , wherein the characteristics of column clusters comprise: including two or more column candidates that are horizontally connected, and wherein horizontal connectedness is a transitive property. 12. The method as recited in claim 9 , wherein reformulating the one or more existing column cluster definitions comprises expanding one or more boundaries of some or all of the existing columns. 13. The method as recited in claim 9 , wherein the second evaluation, the third evaluation, and the fourth evaluation are performed iteratively until a convergence criterion is satisfied. 14. The method as recited in claim 9 , comprising: refining a top edge of the one or more tables and/or the one or more tabular data arrangements. 15. A computer-implemented method for extracting information from one or more non-grid-like tables and/or one or more non-grid-like tabular data arrangements depicted in image data, the method comprising: determining one or more properties of each text line depicted in the image data; determining, based at least in part on the text lines, one or more regions of the one or more tables and/or one or more tabular data arrangements; identifying one or more vertical graphical lines, one or more implied vertical lines, and/or one or more h

Assignees

Inventors

Classifications

  • Tablespace storage structures; Management thereof · CPC title

  • using pattern recognition or machine learning (optical pattern recognition or electronic computations therefor G06V10/88) · CPC title

  • Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • Classification of content, e.g. text, photographs or tables · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11977533B2 cover?
According to one embodiment, a method for detecting, extracting information from, and classifying tables within an original image includes: pre-processing the original image to generate processed image data; detecting one or more tables within the processed image data; extracting the one or more tables from the processed image data; and classifying either: the one or more extracted tables; port…
Who is the assignee on this patent?
Kofax Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/2282. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 07 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).