Automatic partitioning
US-12164512-B2 · Dec 10, 2024 · US
US2019102375A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2019102375-A1 |
| Application number | US-201815935591-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 26, 2018 |
| Priority date | Sep 29, 2017 |
| Publication date | Apr 4, 2019 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
With the scale of information available today along with the existing diverse channels of communication, manual processing of information is becoming a challenge and companies across industries are under tremendous pressure to lower transactional costs. Artificial Intelligence based automation of business transactions has seen regulatory hurdles due to probabilistic nature of the outcome. The main challenge lies in processing of transactions with unstructured information. Systems and methods of the present disclosure uses deterministic as well as probabilistic approaches to maximize accuracy. The larger use of deterministic approach with configurable components and ontologies helps to improvise accuracy, precision and reduce recall. The probabilistic approach is used when there is absence of quality information or less information for learning. Also, confidence indicators are provided at attribute level of data being processed and at each decision level.
Opening claim text (preview).
What is claimed is: 1 . A processor implemented method ( 200 ) comprising: extracting metadata associated with one or more source documents, wherein the one or more source documents are identified as having a structured form, a semi-structured form, an unstructured form, an image form or a combination thereof, based on the extracted metadata ( 202 ); processing the one or more source documents for extracting data comprising entities and attributes thereof ( 204 ); extracting data from the one or more source documents in either native language or English language based on cognitive processing of the one or more source documents to obtain an Enterprise-to Business (E2B) Extensible Markup Language (XML) form having a pre-defined set of templates ( 206 ); evaluating the Enterprise-to Business (E2B) XML form for accuracy and completion of the step of extracting data ( 208 ); and deciding validity of the one or more source documents based on existence of content in the pre-defined set of templates ( 210 ). 2 . The processor implemented method of claim 1 , wherein the step of extracting data comprises using at least one of a deterministic approach and a probabilistic approach. 3 . The processor implemented method of claim 1 , wherein the structured form and the unstructured form of the one or more source documents are processed by: converting the one or more source documents to a formatted Extensible Markup Language (XML) form, wherein the formatted XML form includes in a raw form of one or more of (i) page wise information pertaining to coordinates, font style, font type of text contained therein at a character level and (ii) information pertaining to one or more of cells, border lines associated with the cells and images contained therein; and converting the formatted XML form to an intermediate XML form having a format conforming to a format of the corresponding one or more source documents. 4 . The processor implemented method of claim 3 , wherein the step of extracting data from the structured form of the one or more source documents comprises: identifying sections comprised in the intermediate XML form as parent nodes and extracting data contained in each of the sections based on a first set of pre-defined rules pertaining to the identified sections, wherein the sections include horizontal or vertical tables, forms, key-value pairs and plain text; storing the extracted data pertaining to each of the sections in an extracted XML form wherein entities and attributes thereof in each of the sections represents a child node having a value associated thereof; performing a context dictionary match for the entities and the attributes to obtain matched entities and attributes; and populating the Enterprise-to Business (E2B) XML form based on at least a part of the matched entities and attributes. 5 . The processor implemented method of claim 4 , wherein the step of extracting data from the unstructured form of the one or more source documents comprises: creating a master map of elements comprised in each page of the intermediate XML form, wherein the elements include page numbers and groups based on the attributes; determining a physical layout of each page based on the created master map; identifying the one or more source documents having the unstructured form based on a type associated thereof; creating an extracted XML form having a page by page flow based on the physical layout; segmenting the extracted XML into sentences and further extracting a set of sentence clauses from the sentences by: eliminating word joins and symbols in the sentences; annotating the sentences using a dependency parser; extracting the set of sentence clauses from the annotated sentences based on noun chunks, verb spans and dependencies between words in the sentences and a second set of pre-defined rules, wherein the dependencies are stored as a dependency tree in the form of a graph; parsing subject clauses and object clauses from the set of sentence clauses for the context dictionary match to obtain one or more entities; validating the obtained one or more entities based on either the context dictionary match or a probabilistic approach; extracting one or more validated entities along with attributes thereof as the extracted data; and populating the Enterprise-to Business (E2B) XML form based on at least a part of the extracted data. 6 . The processor implemented method of claim 5 , wherein the step of extracting the set of sentence clauses is preceded by Parts of Speech (POS) tagging. 7 . The processor implemented method of claim 5 , wherein the context dictionary match comprises performing at least one of: checking for an exact match by: comparing one or more words in the set of sentence clauses for the context dictionary match, wherein the context dictionary is pre-defined; identifying an exact match for a single word; checking for a partial match and processing a new match for multiple words; and checking for a fuzzy match by: performing a similarity match between the sentences; computing edit distance between two sentences and an associated similarity score; generating a fuzzy match output by either extracting values based on the computed similarity score, based on a pre-defined number of best matches, or based on a best match. 8 . The processor implemented method of claim 5 , wherein the context dictionary is created by: receiving one or more training documents; annotating sentences contained in the one or more training documents and identifying entities therein; extracting sentence clauses from the annotated sentences and identifying sentence clauses having the identified entities; analyzing context association of the identified entities with verb spans in the sentence clauses; computing frequency of the context association based on a context mining method; and selecting the context association to be included in the context dictionary based on the computed frequency thereof. 9 . The processor implemented method of claim 1 , wherein the step of evaluating the Enterprise-to Business (E2B) XML form comprises: correlating the set of templates obtained from the Enterprise-to Business (E2B) XML form to check similarity across the one or more source documents; and computing a confidence score of extraction of entities and attributes in each of the Enterprise-to Business (E2B) XML form; and computing an overall confidence score for each of the Enterprise-to Business (E2B) XML form based on the confidence score of each of the extraction of entities and attributes and pre-defined weightages thereof. 10 . The processor implemented method of claim 9 , wherein the step of computing a confidence score of extraction of entities and attributes is based on one or more of: the form of the one or more source document; the method of validating the one or more entities based on a context dictionary match or a probabilistic approach; and accuracy of the context dictionary match. 11 . The processor implemented method of claim 1 further comprising classifying the validated one or more source documents based on analyses of the content in the pre-defined set of templates using neural networks. 12 . The processor implemented method of claim 11 further comprising performing decision traceability pertaining to at least the steps of: validating the one or more entities based on a context dictionary match or a probabilistic approach; correlating the set of templates obtained from the Enterprise-to Business (E2B) XML form; deciding on validity of the one or more source documents; and classifying the validated one or more source docum
Templates · CPC title
Mark-up to mark-up conversion (conversion for visualization in web browsing G06F16/9577) · CPC title
Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title
Dictionaries · CPC title
Selection or weighting of terms for indexing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.