Analyzing deduplicated data blocks associated with unstructured documents

US11921676B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11921676-B2
Application numberUS-202117537470-A
CountryUS
Kind codeB2
Filing dateNov 29, 2021
Priority dateNov 29, 2021
Publication dateMar 5, 2024
Grant dateMar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric, selecting a highest sorted unprocessed deduplicated data block, applying text analytics to the selected deduplicated data block, and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block. The method is terminated responsive to satisfaction of at least one stopping condition.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented unstructured document processing method that enables block-based text analytics, the method comprising: identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents; sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; selecting a highest sorted unprocessed deduplicated data block; applying text analytics to the selected deduplicated data block by facilitating application of at least one natural language processing technique; and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined from the text analytics and characterizing the labelled document data based upon application of at least one machine learning technique. 2. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that unique document usage frequency of a next unprocessed deduplicated data block to be selected is below a predetermined document impact threshold. 3. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that unique block occurrence frequency of a next unprocessed deduplicated data block to be selected is below a predetermined block occurrence threshold. 4. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a predetermined unstructured document assessment period has expired. 5. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a number of deduplicated data blocks among the plurality of deduplicated data blocks to which text analytics have been applied exceeds a predetermined block text analytics threshold. 6. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a number of documents among the collection of unstructured documents to which at least one text analytics result has been applied exceeds a predetermined analytics result assignment threshold. 7. The method of claim 1 , wherein applying text analytics to the selected deduplicated data block comprises: determining a data sensitivity value of the selected deduplicated data block by evaluating block data in view of a text analytics learning model. 8. The method of claim 7 , wherein applying text analytics to the selected deduplicated data block further comprises: responsive to determining that the data sensitivity value of the selected deduplicated data block exceeds a sensitive information threshold, classifying as sensitive the selected deduplicated data block. 9. The method of claim 7 , wherein determining the data sensitivity value of the selected deduplicated data block comprises: determining respective data sensitivity values of a plurality of portions of the selected deduplicated data block by evaluating portion data in view of the text analytics learning model; and calculating the data sensitivity value of the selected deduplicated data block by aggregating the respective data sensitivity values of the plurality of portions of the selected deduplicated data block. 10. The method of claim 7 , wherein configuring the text analytics learning model comprises: sampling text analytics results from a plurality of previously processed unstructured document collections. 11. The method of claim 10 , wherein configuring the text analytics learning model further comprises: archiving sensitive data based upon the sampled text analytics results; and facilitating training of the text analytics learning model based upon the archived sensitive data. 12. The method of claim 1 , wherein the at least one block frequency metric includes a unique document usage frequency value corresponding to a number or a percentage of documents among the collection of unstructured documents in which a block among the plurality of deduplicated data blocks is located. 13. The method of claim 1 , wherein the at least one block frequency metric includes a unique block occurrence frequency value corresponding to a number of unique occurrences of a block among the plurality of deduplicated data blocks within the collection of unstructured documents. 14. A computer program product comprising a computer readable storage medium having unstructured document processing program instructions embodied therewith that enable block-based text analytics, the unstructured document processing program instructions executable by a computing device to cause the computing device to: identify a plurality of deduplicated data blocks associated with a collection of unstructured documents; sort the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; select a highest sorted unprocessed deduplicated data block; apply text analytics to the selected deduplicated data block by facilitating application of at least one natural language processing technique; and apply at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined from the text analytics and characterizing the labelled document data based upon application of at least one machine learning technique. 15. The computer program product of claim 14 , wherein applying text analytics to the selected deduplicated data block comprises: determining a data sensitivity value of the selected deduplicated data block by evaluating block data in view of a text analytics learning model. 16. The computer program product of claim 15 , wherein applying text analytics to the selected deduplicated data block further comprises: responsive to determining that the data sensitivity value of the selected deduplicated data block exceeds a sensitive information threshold, classifying as sensitive the selected deduplicated data block. 17. The computer program product of claim 15 , wherein configuring the text analytics learning model comprises: sampling text analytics results from a plurality of previously processed unstructured document collections. 18. A system comprising: at least one processor; and a memory storing an application program, which, when executed on the at least one processor, performs an unstructured document processing operation that enables block-based text analytics, the operation comprising: identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents; sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; selecting a highest sorted unprocessed deduplicated data block; applying text analytics to the selected deduplicated data block facilitating application of at least one natural language processing technique; and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined fr

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Indexing; Data structures therefor; Storage structures · CPC title

  • G06F16/30Primary

    of unstructured textual data (document management systems G06F16/93) · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Clustering; Classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11921676B2 cover?
Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric, selecting a highest s…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).