Extracting searchable information from a digitized document

US10318593B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10318593-B2
Application numberUS-201715681833-A
CountryUS
Kind codeB2
Filing dateAug 21, 2017
Priority dateJun 21, 2017
Publication dateJun 11, 2019
Grant dateJun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Data extraction and automatic validation from digitized documents in non-editable formats is disclosed. Paper documents are digitized or converted into formats suitable for storage on computers or other digital devices. The digitized documents are classified into one of a plurality of document types and based on the document type, document processing rules are selected for analyzing the digitized documents to enable data extraction and automatic validation. The positions and values of the data fields in the digitized documents are obtained using machine learning techniques. The data field values are automatically validated and assigned confidence scores. Data fields with low confidence scores are flagged for manual review.

First claim

Opening claim text (preview).

What is claimed is: 1. A system that extracts searchable data from digitized documents comprising: one or more processors; and a non-transitory data storage comprising instructions that cause the processors to: access a root file comprising a plurality of digitized documents that are generated from a plurality of paper documents, wherein the digitized documents comprise one or more of duplicate copies and multiple versions of one or more of the paper documents; classify the root file under a document type selected from a plurality of document types based on a purpose associated with the paper documents; select document processing rules in accordance with the classification of the root file; select a subset of significant documents from the plurality of the digitized documents by excluding the duplicate copies and multiple versions of one or more of the paper documents from the root file such that a unique copy of the one or more paper documents is selected from the root file; generate an input file comprising the subset of significant documents such that each of the significant documents corresponds to one of the unique copies selected from the root file; access a list of data fields that are to be identified from the input file from data field information included in the document processing rules; determine values and locations of one or more of the data fields within the input file based on the data field information; build an index structure that enables locating the one or more data fields within the input file; validate at least one of the one or more data fields for which the locations and values are identified from the input file; enable access to the one or more data fields and the input file via a user interface with controls populated with data from the index structure, the user interface configured to display a source image portion including the input file and an extracted data portion including the controls with the values and the locations of the one or more data fields in the input file, the user interface further configured to display at least a subset of the one or more data fields that have not been validated via a color coding for manual verification; receive a user selection of one of the controls on the extracted data portion; and display, in the source image portion, a portion of the input file including one or more of the data fields that correspond to the selected control. 2. The system of claim 1 , the instructions to access the root file further comprising instructions that cause the processors to: receive an image generated in a non-editable format by scanning the plurality of paper documents. 3. The system of claim 1 , the instructions to classify the root file further comprising instructions that cause the processors to: access one or more of images, logos and form layouts associated with the plurality of document types from a data store; and classify the root file under one of the plurality of document types based on a match between the images, logos and form layouts in the root file and the images, logos and form layouts accessed from the data store. 4. The system of claim 3 , the instructions to classify the root file under one of the plurality of document types based on a match further comprising instructions that cause the processors to: employ document classifiers trained on image processing techniques to identify the match between the images, logos and form layouts in the root file and the images, logos and form layouts accessed from the data store. 5. The system of claim 1 , the instructions to select a subset of significant documents from the plurality of the digitized documents further comprising instructions that cause the processors to: access document processing rules associated with the document type of the root file; and select the subset of significant documents based on the document processing rules. 6. The system of claim 1 , the instructions to determine values and locations of the one or more data fields within the input file further comprising instructions that cause the processors to: access a plurality of field models respectively corresponding to each of the data fields, each of the plurality of field models including classifiers trained to identify the data fields from the input file; and obtain a page number and position coordinates within a page identified by the page number of each of the one or more data fields within the input file. 7. The system of claim 6 , the instructions to build an index structure further comprising instructions that cause the processors to: build the index structure that stores for each of the one or more data fields, identity of a respective significant document of the subset of significant documents bearing the data field, a page number of the respective significant document within the input file and the position coordinates of the data field within a page of the respective significant document. 8. The system of claim 1 , further comprising instructions that cause the processors to: receive user input identifying a location of at least one of the data fields within the input file wherein the location of the at least one data field could not be determined. 9. The system of claim 8 , the instructions for manual verification further comprising instructions that cause the processors to: explicitly train a respective field model of the data field on the user input for enabling locating the at least one data field. 10. The system of claim 1 , further comprising instructions that cause the processors to: upload validated data from the index structure to an external system; and generate a data file within the external system comprising the uploaded data. 11. A method of extracting and validating data comprising: receiving a root file that comprises a plurality of digitized documents obtained by imaging respective paper documents from a document package; classifying the root file into one of a plurality of document types based on a purpose associated with the document package; selecting document processing rules for processing the root file based on the document type under which the root file is classified; splitting the root file into individual digitized documents based on the document processing rules, the individual digitized documents including multiple versions of at least one document; selecting a subset of the individual digitized documents to form an input file based on document identification information included in the document processing rules; extracting data values and positions of one or more of a plurality of data fields comprised in the input file; calculating respective confidence scores for the one or more data fields, the confidence scores indicating an extent of compliance of the one or more data fields with respective validation conditions; generating an index structure from the input file, the index structure including the data values, the positions and the confidence scores for each of the one or more data fields; displaying an image of the input file within a source image portion of a user interface, the user interface having controls populated with the data values from the index structure and the user interface configured to display at least a subset of the one or more data fields that have not been validated via a color coding for manual verification; displaying within an extracted data portion of the user interface, the controls with the values and positions of the one or more data fields; receiving a user selection of one of the controls within the extracted data portion; and displaying within the source image portion, a portion of the input

Assignees

Inventors

Classifications

  • G06F40/197Primary

    Version control (for software G06F8/71) · CPC title

  • Character encoding · CPC title

  • File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title

  • Details of archiving (lifecycle management in storage systems G06F3/0649; point-in-time backing up or restoration of persistent data G06F11/1446) · CPC title

  • Active pattern learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10318593B2 cover?
Data extraction and automatic validation from digitized documents in non-editable formats is disclosed. Paper documents are digitized or converted into formats suitable for storage on computers or other digital devices. The digitized documents are classified into one of a plurality of document types and based on the document type, document processing rules are selected for analyzing the digitiz…
Who is the assignee on this patent?
Accenture Global Solutions Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/197. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).