System and method for extracting tabular data from electronic document
US-10970535-B2 · Apr 6, 2021 · US
US11977533B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11977533-B2 |
| Application number | US-202217571327-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 7, 2022 |
| Priority date | Apr 2, 2021 |
| Publication date | May 7, 2024 |
| Grant date | May 7, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
According to one embodiment, a method for detecting, extracting information from, and classifying tables within an original image includes: pre-processing the original image to generate processed image data; detecting one or more tables within the processed image data; extracting the one or more tables from the processed image data; and classifying either: the one or more extracted tables; portions of the one or more extracted tables; or a combination thereof. Additional techniques for pre-processing image data to facilitate detection, extraction of information from, and classification of tables (or portions thereof) are also featured. Corresponding systems and computer program products are included in the scope of the invention. The inventive concepts are also applicable to tabular data arrangements that may not fit a strict definition of a “table.”
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for detecting and classifying tables and/or tabular data arrangements within an original image, comprising: pre-processing the original image to generate processed image data, wherein pre-processing the original image comprises identifying one or more delineating lines depicted in the original image, wherein identifying the one or more delineating lines comprises: obtaining a third set of rules defining criteria of delineating lines; evaluating the original image against the third set of rules; and generating a set of delineating lines based on the evaluation; detecting one or more tables and/or one or more tabular data arrangements within the processed image data; extracting the one or more tables and/or the one or more tabular data arrangements from the processed image data; and classifying either: the one or more extracted tables; portions of the one or more extracted tables; the one or more extracted tabular data arrangements; portions of the one or more extracted tabular data arrangements; or a combination of: the one or more extracted tables; the portions of the one or more extracted tables; the one or more extracted tabular data arrangements; and/or the portions of the one or more extracted tabular data arrangements. 2. The method as recited in claim 1 , wherein pre-processing the original image comprises grouping words into phrases, and wherein grouping the words into the phrases comprises: determining whether one or more boundaries between textual elements depicted in the original image are characterized by a width greater than an average width of whitespace characters depicted in the original image; and in response to determining at least one of the one or more boundaries is not characterized by a width greater than the average width of the whitespace characters depicted in the original image, grouping the corresponding textual elements to form one or more phrases. 3. The method as recited in claim 1 , wherein pre-processing the original image comprises detecting subpages, wherein detecting the subpages comprises: obtaining a set of rules defining criteria of subpages, wherein the criteria of subpages comprise: the original image including a vertical graphical line that spans a vertical extent of a page of a document depicted in the original image; and/or the original image depicting horizontally adjacent regions each having a plurality of textual elements and/or horizontal graphical lines exhibiting at least one common alignment characteristic; and evaluating the original image against the set of rules; and defining one or more subpages within the original image based on the evaluation. 4. The method as recited in claim 1 , wherein pre-processing the original image comprises performing layout analysis on the original image, wherein the layout analysis comprises identifying one or more excluded zones within the original image. 5. The method as recited in claim 1 , wherein detecting the one or more tables and/or the one or more tabular data arrangements comprises: performing grid-based detection; denoting one or more areas within the original image that include a grid-like table and/or a grid-like tabular data arrangement as an excluded zone; and performing non-grid-based detection on portions of the original image that are not denoted as excluded zones. 6. The method as recited in claim 1 , wherein pre-processing the original image comprises generating a first representation of the original image; and wherein the first representation excludes textual characters represented in the original image. 7. The method as recited in claim 6 , wherein generating the first representation does not create any graphical lines that are not represented in the original image. 8. The method as recited in claim 1 , wherein pre-processing the image data comprises: generating a first representation of the original image; identifying one or more horizontal graphical lines depicted in the original image, and/or one or more vertical graphical lines depicted in the original image; identifying one or more gaps in the one or more horizontal graphical lines and/or the one or more vertical graphical lines of the first representation; and restoring the one or more horizontal graphical lines and/or the one or more vertical graphical lines by filling in the one or more gaps. 9. A computer-implemented method for detecting one or more non-grid-like tables and/or one or more non-grid-like tabular data arrangements depicted in image data, the method comprising: conducting a first evaluation of the image data against a first set of rules defining characteristics of column seeds, and identifying a set of column seed candidates based on the first evaluation; conducting a second evaluation of the image data against a second set of rules defining characteristics of column clusters, and identifying a set of column cluster candidates based on the second evaluation; conducting a third evaluation of the image data against a third set of rules defining criteria for updating column clusters, and either or both of: reformulating one or more existing column definitions based on the third evaluation; and modifying a definition of some or all of the column cluster candidates based on the third evaluation; conducting a fourth evaluation of the image data against a fourth set of rules defining characteristics of row title columns, and identifying a set of row title column candidates based on the fourth evaluation; and defining a structure and a content of the one or more tables and/or the one or more tabular data arrangements based on a result of some or all of: the first evaluation; the second evaluation; the third evaluation; and the fourth evaluation. 10. The method as recited in claim 9 , wherein the characteristics of column seeds comprise: being an adjacent or nearly adjacent pair of elements that are located in a region of the original image that is not an excluded zone; being an adjacent or nearly adjacent pair of elements each independently comprising a same type of textual element, and not being separated by a different type of textual element; and/or being an adjacent or nearly adjacent pair of elements exhibiting a common alignment characteristic. 11. The method as recited in claim 9 , wherein the characteristics of column clusters comprise: including two or more column candidates that are horizontally connected, and wherein horizontal connectedness is a transitive property. 12. The method as recited in claim 9 , wherein reformulating the one or more existing column cluster definitions comprises expanding one or more boundaries of some or all of the existing columns. 13. The method as recited in claim 9 , wherein the second evaluation, the third evaluation, and the fourth evaluation are performed iteratively until a convergence criterion is satisfied. 14. The method as recited in claim 9 , comprising: refining a top edge of the one or more tables and/or the one or more tabular data arrangements. 15. A computer-implemented method for extracting information from one or more non-grid-like tables and/or one or more non-grid-like tabular data arrangements depicted in image data, the method comprising: determining one or more properties of each text line depicted in the image data; determining, based at least in part on the text lines, one or more regions of the one or more tables and/or one or more tabular data arrangements; identifying one or more vertical graphical lines, one or more implied vertical lines, and/or one or more h
Tablespace storage structures; Management thereof · CPC title
using pattern recognition or machine learning (optical pattern recognition or electronic computations therefor G06V10/88) · CPC title
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
Classification of content, e.g. text, photographs or tables · CPC title
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.