Real-time identification of data candidates for classification based compression
US-2015317381-A1 · Nov 5, 2015 · US
US9564918B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9564918-B2 |
| Application number | US-201313738300-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 10, 2013 |
| Priority date | Jan 10, 2013 |
| Publication date | Feb 7, 2017 |
| Grant date | Feb 7, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Real-time reduction of CPU overhead for data compression is performed by a processor device in a computing environment. Non-compressing heuristics are applied on a randomly selected data sample from data sequences for determining whether to compress the data sequences. A compression potential is calculated based on the non-compressing heuristics. The compression potential is compared to a threshold value. The data sequences are either compressed if the compress threshold is matched, compressed using Huffman coding if Huffman coding threshold is matched, or stored without compression.
Opening claim text (preview).
What is claimed is: 1. A method for real-time reduction of CPU overhead for data compression by a processor device in a computing environment, the method comprising: applying non-compressing heuristics on a randomly selected data sample from data sequences for determining whether to compress the data sequences by calculating a compression potential based on the non-compressing heuristics, wherein the compression potential is compared to a threshold value; and performing: compressing the data sequences if the compress threshold is matched, compressing the data sequences using Huffman coding if Huffman coding threshold is matched, storing the data sequences without compression, using at least one of the non-compressing heuristics, selecting a randomly selected data sample, and computing core characters that compose a predefined percentage of bytes in the randomly selected data sample, an entropy of the data, a relation between appearances of character pairs and a random distribution of the character pairs, and the entropy of the character pairs for determining whether to compress the data sequences, for real-time reduction of CPU overhead for data compression. 2. The method of claim 1 , further including calculating the non-compressing heuristics one after another for developing a heuristic score for making a decision for determining whether to compress the data sequences, wherein the CPU speed is optimized. 3. The method of claim 2 , further including calculating the compression potential based on the heuristic score of the non-compressing heuristics. 4. The method of claim 1 , further including: estimating a relation between appearances of character pairs and a random distribution of the character pairs by: comparing a number of pairs from core characters in the randomly selected data sample against the number of pairs from the core characters expected to appear in a random distribution in the randomly selected data sample, and calculating a L 2 norm distance between a first vector of distributions representing an observed distribution of the number of pairs from the core characters in the randomly selected data sample and a second vector of distributions representing an expected distribution of single values assuming there is no correlation between subsequent pairs from the core characters. 5. The method of claim 1 , further including performing one of turning off and turning on the calculating the compression potential, and determining whether to compress the data sequences according to a predefined setting. 6. The method of claim 1 , wherein the predefined setting, further includes at least one of: continuously applying the non-compressing heuristics and providing an indication as to whether to compress the data sequences, applying the non-compressing heuristics on demand when a compression ratio is above the predetermined threshold for a predefined number of the data sequences, applying the non-compressing heuristics according to a size of a buffer, and applying a prefix compression estimation and deciding whether to compress the data sequences based on a prefix compression ratio when the prefix compression ratio of the data sequences is below a threshold. 7. The method of claim 1 , further including applying the non-compressing heuristics to a relation between character pairs of core characters and random distributions of the character pairs of the core characters from the randomly selected data sample from data sequences for determining whether to compress the data sequences by calculating the compression potential based on the non-compressing heuristics. 8. A system for real-time reduction of CPU overhead for data compression in a computing environment, the system comprising: a processor device operable in the computing storage environment, wherein the processor device: applies non-compressing heuristics on a randomly selected data sample from data sequences for determining whether to compress the data sequences by calculating a compression potential based on the non-compressing heuristics, wherein the compression potential is compared to a threshold value, and performs: compressing the data sequences if the compress threshold is matched, compressing the data sequences using Huffman coding if Huffman coding threshold is matched, storing the data sequences without compression, using at least one of the non-compressing heuristics, selecting a randomly selected data sample, and computing core characters that compose a predefined percentage of bytes in the randomly selected data sample, an entropy of the data, a relation between appearances of character pairs and a random distribution of the character pairs, and the entropy of the character pairs for determining whether to compress the data sequences, for real-time reduction of CPU overhead for data compression. 9. The system of claim 8 , wherein the processor device calculates the non-compressing heuristics one after another for developing a heuristic score for making a decision for determining whether to compress the data sequences, wherein the CPU speed is optimized. 10. The system of claim 9 , wherein the processor device calculates the compression potential based on the heuristic score of the non-compressing heuristics. 11. The system of claim 8 , wherein the processor device: estimates a relation between appearances of character pairs and a random distribution of the character pairs by: comparing a number of pairs from core characters in the randomly selected data sample against the number of pairs from the core characters expected to appear in a random distribution in the randomly selected data sample, and calculating a L 2 norm distance between a first vector of distributions representing an observed distribution of the number of pairs from the core characters in the randomly selected data sample and a second vector of distributions representing an expected distribution of single values assuming there is no correlation between subsequent pairs from the core characters. 12. The system of claim 8 , wherein the processor device performs one of turning off and turning on the calculating the compression potential, and determining whether to compress the data sequences according to a predefined setting. 13. The system of claim 8 , wherein the processor device performs at least one of: continuously applying the non-compressing heuristics and providing an indication as to whether to compress the data sequences, applying the non-compressing heuristics on demand when a compression ratio is above the predetermined threshold for a predefined number of the data sequences, applying the non-compressing heuristics according to a size of a buffer, and applying a prefix compression estimation and deciding whether to compress the data sequences based on a prefix compression ratio when the prefix compression ratio of the data sequences is below a threshold. 14. The system of claim 8 , wherein the processor device applies the non-compressing heuristics to a relation between character pairs of core characters and random distributions of the character pairs of the core characters from the randomly selected data sample from data sequences for determining whether to compress the data sequences by calculating the compression potential based on the non-compressing heuristics. 15. A computer program product for real-time reduction of CPU overhead for data compression by a processor device by a processor device, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code port
Selection between different types of compressors · CPC title
according to the data type · CPC title
Prefix coding · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.