Method and system for I/O parallel distributed garbage collection of a deduplicated datasets
US-10990518-B1 · Apr 27, 2021 · US
US11314724B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11314724-B2 |
| Application number | US-202016865617-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 4, 2020 |
| Priority date | Oct 19, 2017 |
| Publication date | Apr 26, 2022 |
| Grant date | Apr 26, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for data deduplication may include: receiving write operations that write first data; partitioning the first data into a plurality of data portions; generating, using a first hash function, a plurality of data deduplication hash values for the plurality of data portions, wherein a first data deduplication hash value of the plurality of data deduplication hash values is produced by said generating for a first of the plurality of data portions; performing first processing using a Bloom filter to determine whether the first data deduplication hash value has a corresponding first entry in a data store of deduplication hash values; and responsive to the first processing determining the first data deduplication hash value does not have the corresponding first entry in the data store of deduplication hash values, performing second processing, said second processing including adding the corresponding first entry in the data store of deduplication hash values.
Opening claim text (preview).
What is claimed is: 1. A method of performing data deduplication in a system comprising: receiving one or more write operations that write first data; partitioning the first data into a plurality of data portions; generating, using a first hash function, a plurality of data deduplication hash values for the plurality of data portions, wherein a first data portion is included in the plurality of data portions and a first data deduplication hash value of the plurality of data deduplication hash values is produced by said generating for the first data portion, wherein each of the plurality of data deduplication hash values generated using the first hash function uniquely represents a corresponding one of the plurality of data portions; performing first processing using a Bloom filter to determine whether the first data deduplication hash value has a corresponding first entry in a data store of deduplication hash values; and responsive to the first processing determining the first data deduplication hash value does not have the corresponding first entry in the data store of deduplication hash values, performing second processing, said second processing including adding the corresponding first entry in the data store of deduplication hash values, and wherein the system comprises a central processing unit (CPU) domain and a second domain, wherein the CPU domain includes one or more main processors and wherein the second domain includes a first device comprising a plurality of processors, wherein at least said generating is performed in the second domain to offload processing from the CPU domain, and wherein said generating includes the plurality of processors of the first device of the second domain executing a same first instruction stream in parallel, wherein the same first instruction stream includes code of the first hash function and each of the plurality of processors of the first device receives as input a different one of the plurality of data portions and generates a different one of the plurality of data deduplication hash values that corresponds to said different one of the plurality of data portions. 2. The method of claim 1 , wherein the first processing further includes: determining, in accordance with Bloom filter hash functions and the first data deduplication hash value, a first set of bit positions of the Bloom filter, wherein each bit position in the first set identifies a bit position in the Bloom filter; querying the Bloom filter to determine whether any bit position of the first set has a corresponding bit position in the Bloom filter that is set to zero; and responsive to determining that at least one bit position of the first set has a corresponding bit position in the Bloom filter that is set to zero, determining that the first data deduplication hash value does not have the corresponding first entry in the data store of data deduplication hash values. 3. The method of claim 2 , further comprising: responsive to determining that no bit position of the first set has a corresponding bit position in the Bloom filter that is set to zero, determining that the first data deduplication hash value may have the corresponding first entry in the data store of data deduplication hash values and performing additional processing to definitely determine whether the first data deduplication hash value has the corresponding first entry in the data store of data deduplication hash values. 4. The method of claim 3 , wherein said additional processing includes: querying the data store of data deduplication hash values to determine whether the first data deduplication hash value has the corresponding first entry in the data store of data deduplication hash values. 5. The method of claim 4 , further comprising: responsive to said querying the data store of data deduplication hash values determining the first data deduplication hash value has the corresponding first entry in the data store of data deduplication hash values, discarding the first data portion and determining that the first data portion is already stored in a data store of deduplicated data portions. 6. The method of claim 4 , further comprising: responsive to said querying the data store of data deduplication hash values determining the first data deduplication hash value does not have the corresponding first entry in the data store of data deduplication hash values, performing other processing including: adding the corresponding first entry in the data store of deduplication hash values; storing the first data portion in the data store of deduplication data portions; and mapping the corresponding first entry to the first data portion as stored in the data store of deduplication data portions. 7. The method of claim 6 , wherein the other processing includes updating the Bloom filter in accordance with the first set of bit positions, wherein each bit position in the first set identifies a bit position in the Bloom filter which is set to one by said updating. 8. The method of claim 1 , wherein the Bloom filter is a probabilistic data structure that provides a definitive indication of particular data deduplication hash values that do not have corresponding entries in the data store of data deduplication hash values, and provides an indefinite indication of particular data deduplication hash values that have corresponding entries in the data store of data deduplication hash values. 9. The method of claim 8 , wherein the indefinite indication is a probabilistic indication as to whether particular data deduplication hash values have corresponding entries in the data store of data deduplication hash values. 10. The method of claim 1 , wherein the second processing includes storing the first data portion in a data store of deduplicated data portions and updating the Bloom filter in accordance with the first data portion. 11. The method of claim 10 , wherein the data store of deduplication data portions includes only a single unique instance of each data portion processed in connection with data deduplication. 12. The method of claim 1 , wherein said first processing is performed in the second domain and includes executing second code, by one or more processors of the second domain, that uses the Bloom filter to determine whether the first data deduplication hash value has a corresponding first entry in a data store of deduplication hash values. 13. The method of claim 12 , further comprising: journaling write operations that add new entries to the data store of deduplication hash values. 14. A system comprising: one or more processors; and one or more memories comprising code stored thereon that, when executed, performs method of performing data deduplication comprising: receiving one or more write operations that write first data; partitioning the first data into a plurality of data portions; generating, using a first hash function, a plurality of data deduplication hash values for the plurality of data portions, wherein a first data portion is included in the plurality of data portions and a first data deduplication hash value of the plurality of data deduplication hash values is produced by said generating for the first data portion, wherein each of the plurality of data deduplication hash values generated using the first hash function uniquely represents a corresponding one of the plurality of data portions; performing first processing using a Bloom filter to determine whether the first data deduplication hash value has a corresponding first entry in a data store of deduplication hash values; and responsive to the first processing determining the first data de
Ensuring data consistency and integrity · CPC title
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
Search customisation based on user profiles and personalisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.