Systems and methods for online clustering of content items
US-2017185665-A1 · Jun 29, 2017 · US
US11797204B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11797204-B2 |
| Application number | US-202117464904-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 2, 2021 |
| Priority date | Jun 17, 2019 |
| Publication date | Oct 24, 2023 |
| Grant date | Oct 24, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A data processing method includes obtaining a plurality of data blocks, determining a first data block and a second data block from the data blocks, where the first data block has a first hash value, and the second data block has a second hash value, where the first hash value is obtained by performing calculation on the first data block based on a hash algorithm and the second hash value is obtained by performing calculation on the second data block based on the hash algorithm, and combining and compressing the first data block and the second data block based on a degree of similarity of the first data block and the second data block.
Opening claim text (preview).
What is claimed is: 1. A data processing method comprising: obtaining a plurality of data blocks; determining a first data block and a second data block from the data blocks, wherein the first data block has a first hash value that is based on a first calculation on the first data block using a hash algorithm, and wherein the second data block has a second hash value that is based on a second calculation on the second data block using the hash algorithm; determining that the first data block and the second data block meet a similarity condition based on a degree of similarity of the first hash value and the second hash value; determining whether a data reduction ratio corresponding to a target data block to be obtained by combining and compressing the first data block and the second data block reaches a reduction ratio threshold; when determining that the data reduction ratio reaches the reduction ratio threshold, combining and compressing the first data block and the second data block using a first compression algorithm; and when determining that the data reduction ratio does not reach the reduction ratio threshold, separately compressing the first data block and the second data block using a second compression algorithm different from the first compression algorithm. 2. The data processing method of claim 1 , wherein the hash algorithm is a locality-sensitive hash algorithm. 3. The data processing method of claim 2 , further comprising: segmenting the first data block into a plurality of data sub-blocks of different lengths; calculating a hash value of each of the data sub-blocks; performing combination calculation on hash values of the data sub-blocks to obtain a locality-sensitive hash value corresponding to the first data block; and setting the locality-sensitive hash value as the first hash value. 4. The data processing method of claim 3 , further comprising: identifying that a difference between the first hash value and the second hash value is less than a similarity threshold; and obtaining the degree of similarity based on the identifying. 5. The data processing method of claim 4 , further comprising identifying that a Jaccard distance between the first hash value and the second hash value is less than a first distance threshold. 6. The data processing method of claim 4 , further comprising identifying that a Euclidean distance between the first hash value and the second hash value is less than a second distance threshold. 7. The data processing method of claim 4 , further comprising identifying that a Hamming distance between the first hash value and the second hash value is less than a third distance threshold. 8. The data processing method of claim 1 , after combining and compressing the first data block and the second data block, the method further comprising: adding a first combination compression identifier to first metadata information corresponding to the first data block to indicate that a first compression manner of the first data block is combination compression; and adding a second combination compression identifier to second metadata information corresponding to the second data block to indicate that a second compression manner of the second data block is the combination compression. 9. The data processing method of claim 8 , wherein after combining and compressing the first data block and the second data block, the data processing method further comprises: adding a first location identifier to the first metadata information to indicate a first location of the first data block in the target data block; and adding a second location identifier to the second metadata information to indicate a second location of the second data block in the target data block. 10. The data processing method of claim 1 , after combining and compressing the first data block and the second data block, the method further comprising: determining whether a data length of a combined and compressed target data block exceeds a storage granularity; when determining that the data length of the combined and compressed target data block exceeds the storage granularity, splitting the combined and compressed target data block into several granularities based on a granularity unit and adding a flag to an end of each segment of data to identify consecutive data block address; and when determining that the data length of the combined and compressed target data block is less than the storage granularity, adding 0 to an end of the combined and compressed target data block. 11. A data processing apparatus comprising: a communications interface; and a processor coupled to the communications interface and configured to execute instructions stored in a memory to cause the data processing apparatus to: obtain a plurality of data blocks; determine a first data block and a second data block from the data blocks, wherein the first data block has a first hash value that is based on a first calculation on the first data block based on a hash algorithm, and wherein the second data block has a second hash value that is based on a second calculation on the second data block based on the hash algorithm; determine that the first data block and the second data block meet a similarity condition based on a degree of similarity of the first hash value and the second hash value; determine whether a data reduction ratio corresponding to a target data block to be obtained by combining and compressing the first data block and the second data block reaches a reduction ratio threshold; when determining that the data reduction ratio reaches the reduction ratio threshold, combine and compress the first data block and the second data block using a first compression algorithm; and when determining that the data reduction ratio does not reach the reduction ratio threshold, separately compress the first data block and the second data block using a second compression algorithm different from the first compression algorithm. 12. The data processing apparatus of claim 11 , wherein the hash algorithm is a locality-sensitive hash algorithm. 13. The data processing apparatus of claim 12 , wherein the processor further causes the data processing apparatus to: segment the first data block into a plurality of data sub-blocks of different lengths; calculate a hash value of each of the data sub-blocks; perform combination calculation on hash values of the data sub-blocks to obtain a locality-sensitive hash value corresponding to the first data block; and set the locality-sensitive hash value as the first hash value. 14. The data processing apparatus of claim 13 , wherein the processor further causes the data processing apparatus to: identify that a Jaccard distance between the first hash value and the second hash value is less than a first distance threshold; identify that a Euclidean distance between the first hash value and the second hash value is less than a second distance threshold; or identify that a Hamming distance between the first hash value and the second hash value is less than a third distance threshold. 15. The data processing apparatus of claim 11 , wherein the processor further causes the data processing apparatus to: identify that a difference between the first hash value and the second hash value is less than a similarity threshold; and obtain the degree of similarity based on the difference between the first hash value and the second hash value. 16. The data processing apparatus of claim 11 , wherein after combining and compressing the first data block and the second data block, the pro
Management of blocks · CPC title
Saving storage space on storage systems · CPC title
Single storage device · CPC title
Compression (speech analysis-synthesis for redundancy reduction G10L19/00; for image communication H04N); Expansion; Suppression of unnecessary data, e.g. redundancy reduction · CPC title
Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.