System and method for automatically summarizing documents pertaining to a predefined domain
US-11074303-B2 · Jul 27, 2021 · US
US11921676B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11921676-B2 |
| Application number | US-202117537470-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 29, 2021 |
| Priority date | Nov 29, 2021 |
| Publication date | Mar 5, 2024 |
| Grant date | Mar 5, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric, selecting a highest sorted unprocessed deduplicated data block, applying text analytics to the selected deduplicated data block, and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block. The method is terminated responsive to satisfaction of at least one stopping condition.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented unstructured document processing method that enables block-based text analytics, the method comprising: identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents; sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; selecting a highest sorted unprocessed deduplicated data block; applying text analytics to the selected deduplicated data block by facilitating application of at least one natural language processing technique; and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined from the text analytics and characterizing the labelled document data based upon application of at least one machine learning technique. 2. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that unique document usage frequency of a next unprocessed deduplicated data block to be selected is below a predetermined document impact threshold. 3. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that unique block occurrence frequency of a next unprocessed deduplicated data block to be selected is below a predetermined block occurrence threshold. 4. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a predetermined unstructured document assessment period has expired. 5. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a number of deduplicated data blocks among the plurality of deduplicated data blocks to which text analytics have been applied exceeds a predetermined block text analytics threshold. 6. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a number of documents among the collection of unstructured documents to which at least one text analytics result has been applied exceeds a predetermined analytics result assignment threshold. 7. The method of claim 1 , wherein applying text analytics to the selected deduplicated data block comprises: determining a data sensitivity value of the selected deduplicated data block by evaluating block data in view of a text analytics learning model. 8. The method of claim 7 , wherein applying text analytics to the selected deduplicated data block further comprises: responsive to determining that the data sensitivity value of the selected deduplicated data block exceeds a sensitive information threshold, classifying as sensitive the selected deduplicated data block. 9. The method of claim 7 , wherein determining the data sensitivity value of the selected deduplicated data block comprises: determining respective data sensitivity values of a plurality of portions of the selected deduplicated data block by evaluating portion data in view of the text analytics learning model; and calculating the data sensitivity value of the selected deduplicated data block by aggregating the respective data sensitivity values of the plurality of portions of the selected deduplicated data block. 10. The method of claim 7 , wherein configuring the text analytics learning model comprises: sampling text analytics results from a plurality of previously processed unstructured document collections. 11. The method of claim 10 , wherein configuring the text analytics learning model further comprises: archiving sensitive data based upon the sampled text analytics results; and facilitating training of the text analytics learning model based upon the archived sensitive data. 12. The method of claim 1 , wherein the at least one block frequency metric includes a unique document usage frequency value corresponding to a number or a percentage of documents among the collection of unstructured documents in which a block among the plurality of deduplicated data blocks is located. 13. The method of claim 1 , wherein the at least one block frequency metric includes a unique block occurrence frequency value corresponding to a number of unique occurrences of a block among the plurality of deduplicated data blocks within the collection of unstructured documents. 14. A computer program product comprising a computer readable storage medium having unstructured document processing program instructions embodied therewith that enable block-based text analytics, the unstructured document processing program instructions executable by a computing device to cause the computing device to: identify a plurality of deduplicated data blocks associated with a collection of unstructured documents; sort the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; select a highest sorted unprocessed deduplicated data block; apply text analytics to the selected deduplicated data block by facilitating application of at least one natural language processing technique; and apply at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined from the text analytics and characterizing the labelled document data based upon application of at least one machine learning technique. 15. The computer program product of claim 14 , wherein applying text analytics to the selected deduplicated data block comprises: determining a data sensitivity value of the selected deduplicated data block by evaluating block data in view of a text analytics learning model. 16. The computer program product of claim 15 , wherein applying text analytics to the selected deduplicated data block further comprises: responsive to determining that the data sensitivity value of the selected deduplicated data block exceeds a sensitive information threshold, classifying as sensitive the selected deduplicated data block. 17. The computer program product of claim 15 , wherein configuring the text analytics learning model comprises: sampling text analytics results from a plurality of previously processed unstructured document collections. 18. A system comprising: at least one processor; and a memory storing an application program, which, when executed on the at least one processor, performs an unstructured document processing operation that enables block-based text analytics, the operation comprising: identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents; sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; selecting a highest sorted unprocessed deduplicated data block; applying text analytics to the selected deduplicated data block facilitating application of at least one natural language processing technique; and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined fr
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
of unstructured textual data (document management systems G06F16/93) · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Clustering; Classification · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.