Intelligent use of extraction techniques

US11176158B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11176158-B2
Application numberUS-201916527734-A
CountryUS
Kind codeB2
Filing dateJul 31, 2019
Priority dateJul 31, 2019
Publication dateNov 16, 2021
Grant dateNov 16, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A set of documents is received for processing and extraction. A set of processing engines is received, and each processing engine has an expected benefit when processing a document with associated document metadata. The set of documents is analyzed to determine document metadata to be associated with the document. An expected benefit is determined for each of the documents of the set of document when the respective document is processed by a respective processing engine in the set processing engines. An expected cost for processing is determined for each of the documents in each of the set of processing engines. A processing plan for the set of documents is determined wherein the processing plan identifies a selection of documents to be run in respective processing engines of the set of processing engines based on a cost versus benefit analysis. The processing plan is executed extracting information from the identified selection of documents.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for determining a processing plan for a set of documents comprising: receiving a set of documents for processing; receiving a set of processing engines wherein each processing engine has an expected benefit when processing a document with associated document metadata; analyzing each of the set of documents to determine document metadata to be associated with the document; determining an expected benefit for each of the documents of the set of document when the respective document is processed by a respective processing engine in the set processing engines; determining an expected cost for processing each of the documents in each of the set of processing engines; determining a processing plan for the set of documents, wherein the processing plan identifies a selection of documents to be run in respective processing engines of the set of processing engines based on a cost versus benefit analysis; and executing the processing plan thereby extracting information from the identified selection of documents. 2. The method as recited in claim 1 , wherein the set of processing engines comprises a set of natural language processing (NLP) engines, wherein the expected benefit is an information gain from processing the document, wherein the determining an expected benefit further comprises determining an expected for a set of goals wherein ones of the set of goals have higher priorities than other ones of the set of goals. 3. The method as recited in claim 2 , wherein applying the cost versus benefit analysis to determine the processing plan for the set of documents applies the analysis to respective ones of the documents and compares document content of the respective ones to documents already selected for the processing plan is not selected for the process. 4. The method as recited in claim 1 , wherein the analyzing determines the document metadata including metadata selected from a group consisting of semantic markers, document similarity, document type, document content, document size, document formatting and document quality. 5. The method as recited in claim 1 , wherein a first selected portion of a first document is identified for processing by a first processing engine and a second selected portion of the first document is identified for processing by a second processing engine. 6. The method as recited in claim 3 , wherein the cost versus benefit analysis includes a maximum allowable cost for processing the set of documents. 7. The method as recited in claim 2 , where a respective document from the set of documents is not selected for the processing plan if insufficient information gain is expected. 8. The method as recited in claim 2 , further comprising optimizing documents by removing document portions not expected to provide a threshold information gain. 9. The method as recited in claim 2 , further comprising: presenting a plurality of processing plans; responsive to user input, executing a selected processing plan by sending the selected documents to a set of internal or external extraction and processing engines in the selected processing plan; receiving a set of results from the engines; and evaluating an information gain for each document from a processing engine by using a machine learning model running with the extracted features. 10. The method as recited in claim 9 , further comprising training a machine learning model to more accurately predict the information gain. 11. Apparatus, comprising: a processor; computer memory holding computer program instructions executed by the processor for determining a processing plan for a set of documents, the computer program instructions comprising: program code, operative to receive a set of documents for processing; program code, operative to receive a set of processing engines wherein each processing engine has an expected benefit when processing a document with associated document metadata; program code, operative to analyze each of the set of documents to determine document metadata to be associated with the document; program code, operative to determine an expected benefit for each of the documents of the set of document when the respective document is processed by a respective processing engine in the set processing engines; program code, operative to determine an expected cost for processing each of the documents in each of the set of processing engines; program code, operative to determine a processing plan for the set of documents, wherein the processing plan identifies a selection of documents to be run in respective processing engines of the set of processing engines based on a cost versus benefit analysis; and program code, operative to execute the processing plan thereby extracting information from the identified selection of documents. 12. The apparatus as recited in claim 11 , wherein the expected benefit comprises an information gain from processing the set of documents, and the apparatus further comprises program code, operative to predict the information gain based on a document type and a document date of each of the set of documents. 13. The apparatus as recited in claim 11 , wherein a document type and size are used for predicting a cost of processing a document. 14. The apparatus as recited in claim 11 , further comprising: program code, operative to present a plurality of processing plans; program code, operative to execute a selected processing plan by sending the selected documents to a set of internal or external extraction and processing engines in the selected processing plan; program code, operative to receive a set of results from the engines; and program code, operative to evaluate an information gain for each document from a processing engine by using a machine learning model running with the extracted features. 15. A computer program product in a non-transitory computer readable medium for use in a data processing system, the computer program product holding computer program instructions executed by the data processing system for determining a processing plan for a set of documents, the computer program instructions comprising: program code, operative to receive a set of documents for processing; program code, operative to receive a set of processing engines wherein each processing engine has an expected benefit when processing a document with associated document metadata; program code, operative to analyze each of the set of documents to determine document metadata to be associated with the document; program code, operative to determine an expected benefit for each of the documents of the set of document when the respective document is processed by a respective processing engine in the set processing engines; program code, operative to determine an expected cost for processing each of the documents in each of the set of processing engines; program code, operative to determine a processing plan for the set of documents, wherein the processing plan identifies a selection of documents to be run in respective processing engines of the set of processing engines based on a cost versus benefit analysis; and program code, operative to execute the processing plan thereby extracting information from the identified selection of documents. 16. The computer program product as recited in claim 15 , further comprising program code, operative to select a processing engine based on semantic markers and document content. 17. The computer program product as recited in claim 15 , further comprising program code, operative to select a p

Assignees

Inventors

Classifications

  • Document management systems · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Recognition of textual entities · CPC title

  • Data mining · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11176158B2 cover?
A set of documents is received for processing and extraction. A set of processing engines is received, and each processing engine has an expected benefit when processing a document with associated document metadata. The set of documents is analyzed to determine document metadata to be associated with the document. An expected benefit is determined for each of the documents of the set of documen…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).