What technology area does this patent fall under?

Primary CPC classification G06F16/1748. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Analyzing deduplicated data blocks associated with unstructured documents

US11921676B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11921676-B2
Application number	US-202117537470-A
Country	US
Kind code	B2
Filing date	Nov 29, 2021
Priority date	Nov 29, 2021
Publication date	Mar 5, 2024
Grant date	Mar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric, selecting a highest sorted unprocessed deduplicated data block, applying text analytics to the selected deduplicated data block, and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block. The method is terminated responsive to satisfaction of at least one stopping condition.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented unstructured document processing method that enables block-based text analytics, the method comprising: identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents; sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; selecting a highest sorted unprocessed deduplicated data block; applying text analytics to the selected deduplicated data block by facilitating application of at least one natural language processing technique; and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined from the text analytics and characterizing the labelled document data based upon application of at least one machine learning technique. 2. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that unique document usage frequency of a next unprocessed deduplicated data block to be selected is below a predetermined document impact threshold. 3. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that unique block occurrence frequency of a next unprocessed deduplicated data block to be selected is below a predetermined block occurrence threshold. 4. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a predetermined unstructured document assessment period has expired. 5. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a number of deduplicated data blocks among the plurality of deduplicated data blocks to which text analytics have been applied exceeds a predetermined block text analytics threshold. 6. The method of claim 1 , further comprising: terminating the unstructured document processing method responsive to determining that a number of documents among the collection of unstructured documents to which at least one text analytics result has been applied exceeds a predetermined analytics result assignment threshold. 7. The method of claim 1 , wherein applying text analytics to the selected deduplicated data block comprises: determining a data sensitivity value of the selected deduplicated data block by evaluating block data in view of a text analytics learning model. 8. The method of claim 7 , wherein applying text analytics to the selected deduplicated data block further comprises: responsive to determining that the data sensitivity value of the selected deduplicated data block exceeds a sensitive information threshold, classifying as sensitive the selected deduplicated data block. 9. The method of claim 7 , wherein determining the data sensitivity value of the selected deduplicated data block comprises: determining respective data sensitivity values of a plurality of portions of the selected deduplicated data block by evaluating portion data in view of the text analytics learning model; and calculating the data sensitivity value of the selected deduplicated data block by aggregating the respective data sensitivity values of the plurality of portions of the selected deduplicated data block. 10. The method of claim 7 , wherein configuring the text analytics learning model comprises: sampling text analytics results from a plurality of previously processed unstructured document collections. 11. The method of claim 10 , wherein configuring the text analytics learning model further comprises: archiving sensitive data based upon the sampled text analytics results; and facilitating training of the text analytics learning model based upon the archived sensitive data. 12. The method of claim 1 , wherein the at least one block frequency metric includes a unique document usage frequency value corresponding to a number or a percentage of documents among the collection of unstructured documents in which a block among the plurality of deduplicated data blocks is located. 13. The method of claim 1 , wherein the at least one block frequency metric includes a unique block occurrence frequency value corresponding to a number of unique occurrences of a block among the plurality of deduplicated data blocks within the collection of unstructured documents. 14. A computer program product comprising a computer readable storage medium having unstructured document processing program instructions embodied therewith that enable block-based text analytics, the unstructured document processing program instructions executable by a computing device to cause the computing device to: identify a plurality of deduplicated data blocks associated with a collection of unstructured documents; sort the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; select a highest sorted unprocessed deduplicated data block; apply text analytics to the selected deduplicated data block by facilitating application of at least one natural language processing technique; and apply at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined from the text analytics and characterizing the labelled document data based upon application of at least one machine learning technique. 15. The computer program product of claim 14 , wherein applying text analytics to the selected deduplicated data block comprises: determining a data sensitivity value of the selected deduplicated data block by evaluating block data in view of a text analytics learning model. 16. The computer program product of claim 15 , wherein applying text analytics to the selected deduplicated data block further comprises: responsive to determining that the data sensitivity value of the selected deduplicated data block exceeds a sensitive information threshold, classifying as sensitive the selected deduplicated data block. 17. The computer program product of claim 15 , wherein configuring the text analytics learning model comprises: sampling text analytics results from a plurality of previously processed unstructured document collections. 18. A system comprising: at least one processor; and a memory storing an application program, which, when executed on the at least one processor, performs an unstructured document processing operation that enables block-based text analytics, the operation comprising: identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents; sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric; selecting a highest sorted unprocessed deduplicated data block; applying text analytics to the selected deduplicated data block facilitating application of at least one natural language processing technique; and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block, wherein applying the at least one result comprises labelling document data based upon data attributes determined fr

Assignees

Inventors

Classifications

G06F16/1748Primary
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
G06F16/31
Indexing; Data structures therefor; Storage structures · CPC title
G06F16/30Primary
of unstructured textual data (document management systems G06F16/93) · CPC title
G06F16/215Primary
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
G06F16/35
Clustering; Classification · CPC title

Patent family

Related publications grouped by family.

View patent family 86449648

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11921676B2 cover?: Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric, selecting a highest s…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

System and method for automatically summarizing documents pertaining to a predefined domain

Hierarchical, parallel models for extracting in real time high-value information from data streams and system and method for creation of same

Efficient data compression by grouping similar data within a data segment

Distributed ledger based generation of electronic documents

Database entity sensitivity classification

Frequently asked questions