System and method for extracting tabular data from electronic document

US10970535B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10970535-B2
Application numberUS-201916366428-A
CountryUS
Kind codeB2
Filing dateMar 27, 2019
Priority dateJun 11, 2018
Publication dateApr 6, 2021
Grant dateApr 6, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is system for extracting tabular data from electronic document, system having data processing arrangement comprising: tabular data detection module that is operable to: (i) receive electronic document; (ii) determine location of tabular data within electronic document; and (iii) extract image of tabular data from electronic document; and tabular data extraction module that receives extracted image of tabular data from tabular data detection module, wherein tabular data extraction module is operable to: (i) convert received image of tabular data into greyscale image; (ii) extract grid structure from greyscale image; (iii) remove grid structure from greyscale image; (iv) determine position for placement of horizontal and vertical lines in greyscale image; (v) generate horizontal and vertical lines on greyscale image; (vi) perform optical character recognition of text associated with tabular data from received image; and (vii) extract tabular data by combining information of grid structure with text, to generate tabular data.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for extracting tabular data from an electronic document, the system having a data processing arrangement that is to: (i) receive the electronic document; (ii) determine a location of the tabular data within the electronic document; (iii) extract an image of the tabular data from the electronic document; (iv) convert the received image of the tabular data into a greyscale image of the tabular data; (v) extract a grid structure of the tabular data from the greyscale image; (vi) remove the grid structure of the tabular data from the greyscale image, by superimposing the extracted grid structure on to the greyscale image; (vii) determine a position for placement of horizontal and vertical lines in the greyscale image without the grid structure; (viii) generate horizontal and vertical lines on the greyscale image without the grid structure, to obtain information of grid structure of the tabular data; (ix) perform optical character recognition of the text associated with the tabular data from the received image, to obtain information of text associated with the tabular data; and (x) extract the tabular data by combining the information of the grid structure of the tabular data with the text associated with the tabular data, to generate the tabular data. 2. The system of claim 1 , further comprising a database communicatively coupled to the data processing arrangement, wherein the database is configured to store the electronic document and/or the generated tabular data. 3. The system of claim 1 , wherein the data processing arrangement employs deep-learning. 4. The system of claim 3 , wherein the data processing arrangement is configured to determine the location of the tabular data within the electronic document, based on a confidence score associated with detection of the tabular data within the electronic document being higher than a predefined threshold score. 5. The system of claim 1 , wherein the data processing arrangement is configured to extract the image of the tabular data by generating a bounding box around the tabular data within the electronic document. 6. The system of claim 1 , wherein the data processing arrangement is further configured to perform thresholding of the greyscale image subsequent to converting the received image into the greyscale image, wherein the thresholding of the greyscale image is performed by employing adaptive Gaussian technique. 7. The system of claim 6 , wherein the data processing arrangement is further configured to perform bilateral filtering of the greyscale image, subsequent to performing the thresholding of the greyscale image. 8. The system of claim 1 , wherein the data processing arrangement is configured to extract the grid structure from the greyscale image by performing morphological dilation and morphological erosion. 9. The system of claim 8 , wherein the data processing arrangement is configured to perform the morphological dilation and the morphological erosion by using a structural element having a specific size, and wherein the size of the structural element is determined based on a page size of the electronic document. 10. The system of claim 1 , wherein the data processing arrangement is further configured to remove the grid structure of the tabular data from the received image, by superimposing the grid structure extracted from the greyscale image on to the received image. 11. The system of claim 10 , wherein the data processing arrangement is further configured to remove salt and pepper noise from the greyscale image without the grid structure and/or the received image without the grid structure subsequent to removing the grid structure therefrom, and wherein the salt and pepper noise is removed using median filtering. 12. The system of claim 1 , wherein the data processing arrangement is configured to determine the position for placement of horizontal and vertical lines in the greyscale image without the grid structure, by using a sliding window to perform a bitwise ANDing operation of each pixel of the greyscale image without the grid structure and an array of ones, and wherein an output of the bitwise ANDing operation is a pixel array. 13. The system of claim 12 , wherein the data processing arrangement is further configured to perform morphological dilation on the greyscale image without the grid structure prior to performing the bitwise ANDing operation. 14. The system of claim 1 , wherein the data processing arrangement is further configured to determine an ideal position for placement of each horizontal and vertical line in the greyscale image without the grid structure, by filtering redundant positions from all possible positions for placement of horizontal and vertical lines in the greyscale image without the grid structure. 15. The system of claim 1 , wherein the data processing arrangement is configured to generate the horizontal lines and vertical lines on the greyscale image without the grid structure by: generating, by rotating orthogonally the greyscale image without the grid structure, horizontal lines on the greyscale image without the grid structure; and generating vertical lines on the greyscale image without the grid structure. 16. The system of claim 1 , wherein the data processing arrangement is further configured to generate the tabular data as a comma separated values (CSV) file. 17. A method for extracting tabular data from an electronic document, the method comprising: (i) receiving the electronic document; (ii) determining a location of the tabular data within the electronic document; (iii) extracting an image of the tabular data from the electronic document; (iv) converting the extracted image of the tabular data into a greyscale image of the tabular data; (v) extracting a grid structure of the tabular data from the greyscale image; (vi) removing the grid structure of the tabular data from the greyscale image, by superimposing the grid structure on to the greyscale image; (vii) determining a position for placement of horizontal and vertical lines in the greyscale image without the grid structure; (viii) generating horizontal lines and vertical lines on the greyscale image without the grid structure, to obtain information of grid structure of the tabular data; (ix) performing optical character recognition of the text associated with the tabular data from the extracted image, to obtain information of text associated with the tabular data; and (x) extracting the tabular data by combining the information of the grid structure of the tabular data with the text associated with the tabular data, to generate the tabular data. 18. A software product recorded on machine-readable non-transient data storage media, wherein the software product is executable upon computing hardware to implement the method of claim 17 .

Assignees

Inventors

Classifications

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

  • G06V30/412Primary

    Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • Character recognition · CPC title

  • G06V30/40Primary

    Document-oriented image-based pattern recognition · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10970535B2 cover?
Disclosed is system for extracting tabular data from electronic document, system having data processing arrangement comprising: tabular data detection module that is operable to: (i) receive electronic document; (ii) determine location of tabular data within electronic document; and (iii) extract image of tabular data from electronic document; and tabular data extraction module that receives ex…
Who is the assignee on this patent?
Innoplexus Ag
What technology area does this patent fall under?
Primary CPC classification G06V30/412. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 06 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).