Data Deduplication Using Multi-Chunk Predictive Encoding
US-2018196609-A1 · Jul 12, 2018 · US
US12373100B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12373100-B2 |
| Application number | US-202318401582-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 31, 2023 |
| Priority date | Oct 30, 2017 |
| Publication date | Jul 29, 2025 |
| Grant date | Jul 29, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for compressing data that use mismatch probability estimation to improve entropy encoding methods to account for, and efficiently handle, previously unseen data in data to be compressed. A mismatch probability estimate is calculated comprising an estimated frequency at which any given data sourceblock received during encoding will not have a codeword in the codebook. Entropy encoding is used to generate codebooks comprising codewords for data sourceblocks based on the frequency of occurrence of each sourceblock. A “mismatch codeword” is inserted into the codebook based on the mismatch probability estimate to represent those cases when a block of data to be encoded does not have a codeword in the codebook.
Opening claim text (preview).
What is claimed is: 1. A system for compressing data using mismatch probability estimation, comprising: a plurality of computing devices each comprising at least a processor, a memory, and a network interface; wherein a plurality of programming instructions stored in one or more of the memories and operating on one or more of the processors of the plurality of computing devices causes the plurality of computing devices to: calculate a mismatch probability estimate comprising a probability that any given sourceblock in a non-training dataset will not match any sourceblock that was contained in a training data set; generate a mismatch sourceblock representing sourceblocks that were not contained in the training data set, and assign the mismatch probability estimate to the mismatch sourceblock as the frequency of occurrence of the mismatch sourceblock; and generate a codebook from the sourceblocks of the training data set and the mismatch sourceblock using an entropy encoding method wherein codewords are assigned to each sourceblock based on its frequency of occurrence. 2. The system of claim 1 , further wherein the computing devices: receive the non-training data set for encoding, the non-training data set comprising sourceblocks of data; for each sourceblock of the non-training data set, look up and return the codeword for that sourceblock in the codebook and insert that codeword into an encoded data stream; where the returned codeword is the codeword for the mismatch sourceblock, generate a new codeword for the looked up sourceblock using a secondary encoding method, and insert the new codeword into the encoded data stream. 3. The system of claim 1 , further wherein the computing devices: receive an encoded data stream comprising codewords; for each codeword in the encoded data stream, look up and return the sourceblock for that codeword in the codebook and insert that sourceblock into a decoded data stream; and where the returned sourceblock is the mismatch sourceblock, determine the sourceblock for that codeword using the secondary encoding method, and insert the determined sourceblock into the decoded data stream. 4. The system of claim 1 , wherein the training data set is a low-entropy data set, either having a small subset of sourceblocks of a given size relative to the total possible number of sourceblocks of that size or having a set of sourceblocks closely matching the set of sourceblocks expected in the non-training data set. 5. The system of claim 1 , wherein the entropy encoding method is Huffman coding or a known variant thereof. 6. The system of claim 1 , wherein the mismatch probability estimate, q, is calculated as q=M/N, where: M is the number of times a previously-unobserved sourceblock appeared in the training data set; and N is the total number of sourceblocks observed in the training data set. 7. The system of claim 6 , wherein the mismatch probability estimate, q, is calculated as q=M/N=(Σ j=1 N X j )/N, where: X j = { 1 if S j ∉ { S i : 1 ≤ i < j } 0 otherwise ; N is the total number of sourceblocks observed in the training data set. 8. The system of claim 7 , wherein an exponentially-weighted moving average is applied to the calculation of q=M/N=(Σ j=1 N X j )/N. 9. The system of claim 8 , wherein the exponentially-weighted moving average is a modified form of an exponentially-weighted moving average of the form: μ j = { 1 if j = 0 ( 1 - β j ) μ j - 1 + β j X j if j > 0 ; β j =C log(j)/j and β 1 =1, for some constant C. 10. A method for compressing data using mismatch probability estimation, comprising the steps of: calculating a mismatch probability estimate comprising a probability that any given sourceblock in a non-training data set to be later received for encoding will not b
Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title
in relation to content · CPC title
Command handling arrangements, e.g. command buffers, queues, command scheduling · CPC title
Encoder aspects · CPC title
Decoder aspects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.