What technology area does this patent fall under?

Primary CPC classification G06F16/313. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Apr 04 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Automated cognitive processing of source agnostic data

US2019102375A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2019102375-A1
Application number	US-201815935591-A
Country	US
Kind code	A1
Filing date	Mar 26, 2018
Priority date	Sep 29, 2017
Publication date	Apr 4, 2019
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

With the scale of information available today along with the existing diverse channels of communication, manual processing of information is becoming a challenge and companies across industries are under tremendous pressure to lower transactional costs. Artificial Intelligence based automation of business transactions has seen regulatory hurdles due to probabilistic nature of the outcome. The main challenge lies in processing of transactions with unstructured information. Systems and methods of the present disclosure uses deterministic as well as probabilistic approaches to maximize accuracy. The larger use of deterministic approach with configurable components and ontologies helps to improvise accuracy, precision and reduce recall. The probabilistic approach is used when there is absence of quality information or less information for learning. Also, confidence indicators are provided at attribute level of data being processed and at each decision level.

First claim

Opening claim text (preview).

What is claimed is: 1 . A processor implemented method ( 200 ) comprising: extracting metadata associated with one or more source documents, wherein the one or more source documents are identified as having a structured form, a semi-structured form, an unstructured form, an image form or a combination thereof, based on the extracted metadata ( 202 ); processing the one or more source documents for extracting data comprising entities and attributes thereof ( 204 ); extracting data from the one or more source documents in either native language or English language based on cognitive processing of the one or more source documents to obtain an Enterprise-to Business (E2B) Extensible Markup Language (XML) form having a pre-defined set of templates ( 206 ); evaluating the Enterprise-to Business (E2B) XML form for accuracy and completion of the step of extracting data ( 208 ); and deciding validity of the one or more source documents based on existence of content in the pre-defined set of templates ( 210 ). 2 . The processor implemented method of claim 1 , wherein the step of extracting data comprises using at least one of a deterministic approach and a probabilistic approach. 3 . The processor implemented method of claim 1 , wherein the structured form and the unstructured form of the one or more source documents are processed by: converting the one or more source documents to a formatted Extensible Markup Language (XML) form, wherein the formatted XML form includes in a raw form of one or more of (i) page wise information pertaining to coordinates, font style, font type of text contained therein at a character level and (ii) information pertaining to one or more of cells, border lines associated with the cells and images contained therein; and converting the formatted XML form to an intermediate XML form having a format conforming to a format of the corresponding one or more source documents. 4 . The processor implemented method of claim 3 , wherein the step of extracting data from the structured form of the one or more source documents comprises: identifying sections comprised in the intermediate XML form as parent nodes and extracting data contained in each of the sections based on a first set of pre-defined rules pertaining to the identified sections, wherein the sections include horizontal or vertical tables, forms, key-value pairs and plain text; storing the extracted data pertaining to each of the sections in an extracted XML form wherein entities and attributes thereof in each of the sections represents a child node having a value associated thereof; performing a context dictionary match for the entities and the attributes to obtain matched entities and attributes; and populating the Enterprise-to Business (E2B) XML form based on at least a part of the matched entities and attributes. 5 . The processor implemented method of claim 4 , wherein the step of extracting data from the unstructured form of the one or more source documents comprises: creating a master map of elements comprised in each page of the intermediate XML form, wherein the elements include page numbers and groups based on the attributes; determining a physical layout of each page based on the created master map; identifying the one or more source documents having the unstructured form based on a type associated thereof; creating an extracted XML form having a page by page flow based on the physical layout; segmenting the extracted XML into sentences and further extracting a set of sentence clauses from the sentences by: eliminating word joins and symbols in the sentences; annotating the sentences using a dependency parser; extracting the set of sentence clauses from the annotated sentences based on noun chunks, verb spans and dependencies between words in the sentences and a second set of pre-defined rules, wherein the dependencies are stored as a dependency tree in the form of a graph; parsing subject clauses and object clauses from the set of sentence clauses for the context dictionary match to obtain one or more entities; validating the obtained one or more entities based on either the context dictionary match or a probabilistic approach; extracting one or more validated entities along with attributes thereof as the extracted data; and populating the Enterprise-to Business (E2B) XML form based on at least a part of the extracted data. 6 . The processor implemented method of claim 5 , wherein the step of extracting the set of sentence clauses is preceded by Parts of Speech (POS) tagging. 7 . The processor implemented method of claim 5 , wherein the context dictionary match comprises performing at least one of: checking for an exact match by: comparing one or more words in the set of sentence clauses for the context dictionary match, wherein the context dictionary is pre-defined; identifying an exact match for a single word; checking for a partial match and processing a new match for multiple words; and checking for a fuzzy match by: performing a similarity match between the sentences; computing edit distance between two sentences and an associated similarity score; generating a fuzzy match output by either extracting values based on the computed similarity score, based on a pre-defined number of best matches, or based on a best match. 8 . The processor implemented method of claim 5 , wherein the context dictionary is created by: receiving one or more training documents; annotating sentences contained in the one or more training documents and identifying entities therein; extracting sentence clauses from the annotated sentences and identifying sentence clauses having the identified entities; analyzing context association of the identified entities with verb spans in the sentence clauses; computing frequency of the context association based on a context mining method; and selecting the context association to be included in the context dictionary based on the computed frequency thereof. 9 . The processor implemented method of claim 1 , wherein the step of evaluating the Enterprise-to Business (E2B) XML form comprises: correlating the set of templates obtained from the Enterprise-to Business (E2B) XML form to check similarity across the one or more source documents; and computing a confidence score of extraction of entities and attributes in each of the Enterprise-to Business (E2B) XML form; and computing an overall confidence score for each of the Enterprise-to Business (E2B) XML form based on the confidence score of each of the extraction of entities and attributes and pre-defined weightages thereof. 10 . The processor implemented method of claim 9 , wherein the step of computing a confidence score of extraction of entities and attributes is based on one or more of: the form of the one or more source document; the method of validating the one or more entities based on a context dictionary match or a probabilistic approach; and accuracy of the context dictionary match. 11 . The processor implemented method of claim 1 further comprising classifying the validated one or more source documents based on analyses of the content in the pre-defined set of templates using neural networks. 12 . The processor implemented method of claim 11 further comprising performing decision traceability pertaining to at least the steps of: validating the one or more entities based on a context dictionary match or a probabilistic approach; correlating the set of templates obtained from the Enterprise-to Business (E2B) XML form; deciding on validity of the one or more source documents; and classifying the validated one or more source docum

Assignees

Tata Consultancy Services Ltd

Inventors

Classifications

G06F40/186
Templates · CPC title
G06F16/88
Mark-up to mark-up conversion (conversion for visualization in web browsing G06F16/9577) · CPC title
G06F16/907
Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title
G06F40/242
Dictionaries · CPC title
G06F16/313Primary
Selection or weighting of terms for indexing · CPC title

Patent family

Related publications grouped by family.

View patent family 61911348

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019102375A1 cover?: With the scale of information available today along with the existing diverse channels of communication, manual processing of information is becoming a challenge and companies across industries are under tremendous pressure to lower transactional costs. Artificial Intelligence based automation of business transactions has seen regulatory hurdles due to probabilistic nature of the outcome. The m…
Who is the assignee on this patent?: Tata Consultancy Services Ltd
What technology area does this patent fall under?: Primary CPC classification G06F16/313. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Apr 04 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).