System and method for extracting tabular data from electronic document
US-10970535-B2 · Apr 6, 2021 · US
US11977534B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11977534-B2 |
| Application number | US-202217850835-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 27, 2022 |
| Priority date | Apr 2, 2021 |
| Publication date | May 7, 2024 |
| Grant date | May 7, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
According to one embodiment, a computer-implemented method for classifying one or more tables and/or one or more tabular data arrangements depicted in image data includes: training a machine learning model, using a training dataset representing a plurality of different tables and/or tabular data arrangements, based at least in part on a plurality of recognized textual elements within the training dataset; and outputting a trained classification model based on the training, wherein the trained classification model is configured to classify one or more tables and/or one or more tabular data arrangements represented within a test dataset according to: one or more table classifications; one or more tabular data arrangement classifications; and/or one or more column classifications; and classifying the one or more tables and/or the one or more tabular data arrangements represented within the test dataset using the trained classification model. Methods for detecting, extracting, and classifying tables are also disclosed.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for detecting and classifying tables and/or tabular data arrangements within an original image, comprising: pre-processing the original image to generate processed image data; detecting one or more tables and/or one or more tabular data arrangements within the processed image data; extracting the one or more tables and/or the one or more tabular data arrangements from the processed image data; and classifying either: the one or more extracted tables; portions of the one or more extracted tables; the one or more extracted tabular data arrangements; portions of the one or more extracted tabular data arrangements; or a combination of: the one or more extracted tables; the portions of the one or more extracted tables; the one or more extracted tabular data arrangements; and/or the portions of the one or more extracted tabular data arrangements; and wherein classifying the one or more extracted tables; the portions of the one or more extracted tables; the one or more extracted tabular data arrangements; and/or the portions of the one or more extracted tabular data arrangements comprises: evaluating a test dataset using a trained classification model; determining at least one score vector based on the evaluation; identifying a highest score within the at least one score vector; determining whether the highest score for the at least one score vector is greater than a corresponding one of a plurality of optimized score thresholds; and in response to determining the highest score for the at least one score vector is greater than the corresponding one of the plurality of optimized score thresholds, returning a positive result for the one of the corresponding classification of the one or more extracted tables; the portions of the one or more extracted tables; the one or more extracted tabular data arrangements; and/or the portions of the one or more extracted tabular data arrangements. 2. The method as recited in claim 1 , wherein the one or more extracted tables, the portions of the one or more extracted tables, the one or more extracted tabular data arrangements, and/or the portions of the one or more extracted tabular data arrangements collectively comprise: one or more classifications of interest; and at least one classification not of interest. 3. The method as recited in claim 1 , wherein the classification of: the one or more extracted tables; the portions of the one or more extracted tables; the one or more extracted tabular data arrangements; and/or the portions of the one or more extracted tabular data arrangements are each independently based on tables and/or tabular data arrangements represented in a training dataset. 4. The method as recited in claim 1 , wherein the classifying includes training a machine learning model, wherein the training comprises training the machine learning model to recognize a plurality of relevancy criteria, and wherein each relevancy criterion is independently indicative of a given one of the one or more tables and/or a given one of the one or more tabular data arrangements corresponding to either: one of the one or more table classifications; one of the one or more tabular data arrangement classifications; or one of the one or more column classifications. 5. The method as recited in claim 4 , wherein the plurality of relevancy criteria comprise: a frequency of one or more terms represented in the one or more tables and/or the one or more tabular data arrangements; a term-frequency/inverse-document frequency (tf-idf) corresponding to the one or more terms and one or more documents representing the one or more tables and/or the one or more tabular data arrangements; a structure of a sub-region of the one or more tables and/or the one or more tabular data arrangements; and/or structured information describing some or all of the one or more tables and/or the one or more tabular data arrangements. 6. The method as recited in claim 1 , comprising training at least one machine learning model, using a training dataset representing a plurality of different types of tables and/or tabular data arrangements, based at least in part on: a plurality of recognized textual elements within the training dataset; and a plurality of recognized regions and/or subregions of the different types of tables and/or tabular data arrangements represented by the training set. 7. The method as recited in claim 6 , wherein the training comprises generating a score matrix comprising a plurality of score vectors that each independently comprise a plurality of scores for a single table or tabular data arrangement represented in the training dataset; and wherein each of the plurality of scores for each score vector independently corresponds to a possible classification of the single table or tabular data arrangement represented in the training dataset. 8. The method as recited in claim 6 , further comprising: associating a known classification type with each score vector of the score matrix; and identifying an optimal score threshold for each known classification type. 9. A computer-implemented method for classifying one or more tables and/or one or more tabular data arrangements represented within a test dataset, the method comprising: using a trained classification model to classify the one or more tables and/or one or more tabular data arrangements represented within a test dataset according to: one or more table classifications; one or more tabular data arrangement classifications; and/or one or more column classifications; and wherein classifying the one or more tables and/or the one or more tabular data arrangements comprises: evaluating the test dataset using the trained classification model; determining at least one score vector based on the evaluation; identifying a highest score within the at least one score vector; determining whether the highest score for the at least one score vector is greater than a corresponding one of a plurality of optimized score thresholds; and in response to determining the highest score for the at least one score vector is greater than the corresponding one of the plurality of optimized score thresholds, returning a positive result for the one of the corresponding classification of the one or more tables and/or the one or more extracted tabular data arrangements. 10. The method as recited in claim 9 , comprising training a machine learning model, using a training dataset representing a plurality of different tables and/or tabular data arrangements, based at least in part on a plurality of recognized textual elements within the training dataset; wherein training the machine learning model comprises training the machine learning model to: recognize the textual elements within the training dataset; and understand a structure of the different tables and/or tabular data arrangements. 11. The method as recited in claim 10 , wherein the training is further based at least in part on a structure of the one or more tables and/or the one or more tabular data arrangements. 12. The method as recited in claim 10 , wherein the training is further based at least in part on a plurality of recognized regions and/or subregions of the different types of tables and/or tabular data arrangements represented by the training set. 13. The method as recited in claim 10 , wherein the training comprises generating a score matrix comprising a plurality of score vectors that each independently comprise a plurality of scores for a single table or tabular data arrangement represented in the training dataset; and wherein each of th
Tablespace storage structures; Management thereof · CPC title
using pattern recognition or machine learning (optical pattern recognition or electronic computations therefor G06V10/88) · CPC title
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
Classification of content, e.g. text, photographs or tables · CPC title
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.