System and method for extracting data from a non-structured document

US10740372B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10740372-B2
Application numberUS-201615085781-A
CountryUS
Kind codeB2
Filing dateMar 30, 2016
Priority dateApr 2, 2015
Publication dateAug 11, 2020
Grant dateAug 11, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data object representing an electronic document having a plurality of data items each having at least one data value associated therewith is loaded from memory. The data object is searched for plurality of data items by keyword search for at least one candidate target data item. A target data item is selected by identifying at least one ancillary data item known to be located within the electronic document proximate to the at least one candidate. A target field within the electronic document is generated to encapsulate the at least one data value associated with the selected target data item. A format of the at least one data value is compared with a predetermined data value format and extracted from the target field in response to the format of the at least one data value matching the predetermined data value format for storage in a table of a database.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of extracting data from an electronic document comprising: loading, from a memory, a data object representing an electronic document having a plurality of data items each having at least one data value associated therewith; searching the plurality of data items in the electronic document by keyword search for at least one candidate target data items; selecting a target data item from the at least one candidate target data items by identifying at least one ancillary data item known to be located within the electronic document proximate to the at least one candidate target data items; generating a target field within the electronic document to encapsulate the at least one data value associated with the selected target data item, the target field having a predetermined height substantially equal to a height of the target data item, the target field extending horizontally within the electronic document in a direction away from the target data item and extending to a predetermined position in the electronic document; comparing a format of the at least one data value with a predetermined data value format; and extracting the at least one data value from the target field in response to the format of the at least one data value matching the predetermined data value format for storage in a table of a database. 2. The method according to claim 1 , further comprising generating a compound data item field around the candidate data item and the at least one ancillary data item; comparing a format of the compound data item field with a predetermined format; and selecting, as the target data item, the candidate target data item when the format of the compound data item field matches the predetermined format. 3. The method according to claim 1 , further comprising in response to identifying a plurality of candidate data items, generating a data field around each of the candidate data items; extending the data field to form a compound data item field in a direction away from each candidate data item to identify the at least one ancillary data item; and selecting, as the target data item, the candidate data item within the compound data item field including a predetermined first ancillary data item within a predetermined distance from the respective candidate data item. 4. The method according to claim 3 , wherein in response to determining that more than one candidate data item is within a predetermined distance to the first ancillary data item, extending the compound data item field further in one of a same direction or different direction to identify at least one further ancillary data item within a predetermined distance from each of the candidate data item and the first ancillary data item; and selecting, as the target data item, the candidate data item from within the compound data item field that is a predetermined distance from each of the first ancillary data item and at least one further ancillary data item. 5. The method according to claim 1 , wherein the step of generating a target field further comprises creating the target field having a predetermined height substantially equal to a height of the target data item, the target field beginning at a position in the electronic document a predetermined distance from a margin thereof and aligned with the selected target data item, the target field begin sequentially extending horizontally within the electronic document in a direction towards the target data item. 6. The method according to claim 1 , further comprising receiving electronic document data; and performing an optical character recognition process on the electronic document data to create the data object. 7. A server apparatus that extracts data from an electronic document, the server comprising: a controller; a memory coupled to the controller storing instructions that, when executed by the controller control the server to load, from a memory, a data object representing an electronic document having a plurality of data items each having at least one data value associated therewith; search the plurality of data items in the electronic document by keyword search for at least one candidate target data items; select a target data item from the at least one candidate target data items by identifying at least one ancillary data item known to be located within the electronic document proximate to the at least one candidate target data items; generate a target field within the electronic document to encapsulate the at least one data value associated with the selected target data item, the target field having a predetermined height substantially equal to a height of the target data item, the target field extending horizontally within the electronic document in a direction away from the target data item and extending to a predetermined position in the electronic document; compare a format of the at least one data value with a predetermined data value format; and extract the at least one data value from the target field in response to the format of the at least one data value matching the predetermined data value format for storage in a table of a database. 8. The server apparatus according to claim 7 , wherein execution of the instructions causes the server apparatus to generate a compound data item field around the candidate target data item and the at least one ancillary data item; compare a format of the compound data item field with a predetermined format; and select, as the target data item, the candidate data item when the format of the compound data item field matches the predetermined format. 9. The server apparatus according to claim 7 , wherein execution of the instructions causes the server apparatus to in response to identifying a plurality of candidate data items, generate a data field around each of the candidate data items; extend the data field to form a compound data item field in a direction away from each candidate data item to identify the at least one ancillary data item; and select, as the target data item, the candidate data item within the compound data item field including a predetermined first ancillary data item within a predetermined distance from the respective candidate data item. 10. The server apparatus according to claim 9 , wherein execution of the instructions causes the server apparatus to in response to determining that more than one candidate data item is within a predetermined distance to the first ancillary data item, extend the compound data item field further in one of a same direction or different direction to identify at least one further ancillary data item within a predetermined distance from each of the candidate data item and the first ancillary data item; and select, as the target data item, the candidate data item from within the compound data item field that is a predetermined distance from each of the first ancillary data item and at least one further ancillary data item. 11. The server apparatus according to claim 9 , wherein execution of the instructions causes the server apparatus to receive electronic document data; and perform an optical character recognition process on the electronic document data to create the data object. 12. The server apparatus according to claim 7 , wherein generation of the a target field further includes creating the target field having a predetermined height substantially equal to a height of the target data item, the target field beginning at a position in the electronic document a predetermined distance from a margin thereof and aligned with the selected target data item, the target field begin sequentially extend

Assignees

Inventors

Classifications

  • G06V30/412Primary

    Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

  • of printed characters having additional code marks or containing code marks · CPC title

  • Authentication · CPC title

  • Document management systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10740372B2 cover?
A data object representing an electronic document having a plurality of data items each having at least one data value associated therewith is loaded from memory. The data object is searched for plurality of data items by keyword search for at least one candidate target data item. A target data item is selected by identifying at least one ancillary data item known to be located within the elect…
Who is the assignee on this patent?
Canon Information & Imaging Solutions Inc, Canon Usa Inc
What technology area does this patent fall under?
Primary CPC classification G06V30/412. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 11 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).