Elastic, ephemeral in-line deduplication service
US-11537573-B2 · Dec 27, 2022 · US
US12353370B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12353370-B2 |
| Application number | US-202218071790-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 30, 2022 |
| Priority date | Sep 25, 2015 |
| Publication date | Jul 8, 2025 |
| Grant date | Jul 8, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A deduplication service can be provided to a storage domain from a services framework that expands and contracts to both meet service demand and to conform to resource management of a compute domain. The deduplication service maintains a fingerprint database and reference count data in compute domain resources, but persists these into the storage domain for use in the case of a failure or interruption of the deduplication service in the compute domain. The deduplication service responds to service requests from the storage domain with indications of paths in a user namespace and whether or not a piece of data had a fingerprint match in the fingerprint database. The indication of a match guides the storage domain to either store the piece of data into the storage backend or to reference another piece of data. The deduplication service uses the fingerprints to define paths for corresponding pieces of data.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: in response to receiving a write request targeting a data unit, dividing the data unit into sub-units according to a sub-unit size; determining, by a redirector, a number of deduplicator instances to instantiate for deduplicating the sub-units based upon a deduplication service policy specifying a threshold amount of data that can be processed by a single executing deduplicator instance; creating a data unit manifest for the data unit with an indication of an order and count of the sub-units, wherein the data unit manifest is populated with paths to the sub-units according to a hierarchical namespace or a flat namespace where a path is a namespace identifier used to obtain data of a constituent sub-unit; and requesting deduplication for the sub-units by the number of deduplicator instances using the paths within the data unit manifest. 2. The method of claim 1 , wherein each deduplicator instance is assign up to the threshold amount of data to deduplicate as specified by the deduplication service policy. 3. The method of claim 1 , wherein determining the deduplicator instances to instantiate comprises: determining the number of deduplicator instances to instantiate and execute for deduplicating the sub-units based upon a size of the data unit. 4. The method of claim 1 , wherein determining the deduplicator instances to instantiate comprises: determining the number of deduplicator instances to instantiate and execute for deduplicating the sub-units based upon a number of sub-units into which the data unit is divided. 5. The method of claim 1 , comprising: obtaining, from a service dispatcher, location information for the deduplicator instances, wherein the location information corresponds to network addresses and ports of the deduplicator instances. 6. The method of claim 1 , comprising: caching, by the redirector into a cache, location information retrieved from a service dispatcher for the deduplicator instances; and utilizing the location information within the cache for processing a subsequent deduplication request. 7. The method of claim 1 , comprising: contacting, off a request path associated with processing the write request, a service dispatcher to refresh a cache used by the redirector to cache location information retrieved from the service dispatcher for the deduplicator instances. 8. The method of claim 1 , comprising: in response to determining that a deduplication service is unavailable based upon insufficient resources in a compute domain, notifying the deduplicator instance, by a service dispatcher, that the deduplication service is unavailable. 9. The method of claim 1 , comprising: hosting, by a deduplication service, a garbage collector to maintain reference count data for donor file data chunks managed within a service space in accordance with the reference count data. 10. The method of claim 1 , comprising: scanning, by a garbage collector of a deduplication service, a user space within a storage backend based upon checkpoints to identify data unit manifests added to the user space since a last scan; and incrementing reference counts for a set of sub-units indicated by the data unit manifests. 11. A non-transitory machine readable medium comprising instructions for performing a method, which when executed by a machine, causes the machine to perform operations comprising: in response to receiving a write request targeting a data unit, dividing the data unit into sub-units according to a sub-unit size; determining, by a redirector, a number of deduplicator instances to instantiate for deduplicating the sub-units based upon a deduplication service policy specifying a threshold amount of data that can be processed by a single executing deduplicator instance; creating a data unit manifest for the data unit with an indication of an order and count of the sub-units, wherein the data unit manifest is populated with paths to the sub-units according to a hierarchical namespace or a flat namespace where a path is a namespace identifier used to obtain data of a constituent sub-unit; and requesting deduplication for the sub-units by the number of deduplicator instances using the paths within the data unit manifest. 12. The non-transitory machine readable medium of claim 11 , wherein each deduplicator instance is assign up to the threshold amount of data to deduplicate as specified by the deduplication service policy. 13. The non-transitory machine readable medium of claim 11 , wherein determining the deduplicator instances to instantiate comprises: determining the number of deduplicator instances to instantiate and execute for deduplicating the sub-units based upon a size of the data unit. 14. The non-transitory machine readable medium of claim 11 , wherein determining the deduplicator instances to instantiate comprises: determining the number of deduplicator instances to instantiate and execute for deduplicating the sub-units based upon a number of sub-units into which the data unit is divided. 15. The non-transitory machine readable medium of claim 11 , comprising: obtaining, from a service dispatcher, location information for the deduplicator instances, wherein the location information corresponds to network addresses and ports of the deduplicator instances. 16. The non-transitory machine readable medium of claim 11 , comprising: caching, by the redirector into a cache, location information retrieved from a service dispatcher for the deduplicator instances; and utilizing the location information within the cache for processing a subsequent deduplication request. 17. The non-transitory machine readable medium of claim 11 , comprising: contacting, off a request path associated with processing the write request, a service dispatcher to refresh a cache used by the redirector to cache location information retrieved from the service dispatcher for the deduplicator instances. 18. A computing device comprising: a memory comprising machine executable code for performing a method; and a processor coupled to the memory, the processor configured to execute the machine executable code to cause the processor to: in response to receiving a write request targeting a data unit, divide the data unit into sub-units according to a sub-unit size; determine, by a redirector, a number of deduplicator instances to instantiate for deduplicating the sub-units based upon a deduplication service policy specifying a threshold amount of data that can be processed by a single executing deduplicator instance; create a data unit manifest for the data unit with an indication of an order and count of the sub-units, wherein the data unit manifest is populated with paths to the sub-units according to a hierarchical namespace or a flat namespace where a path is a namespace identifier used to obtain data of a constituent sub-unit; and request deduplication for the sub-units by the number of deduplicator instances using the paths within the data unit manifest. 19. The computing device of claim 18 , wherein each deduplicator instance is assign up to the threshold amount of data to deduplicate as specified by the deduplication service policy. 20. The computing device of claim 18 , wherein the machine executable code causes the processor to: cache, by the redirector into a cache, location information retrieved from a service dispatcher for the deduplicator instances; and utilize the location information within the cache for processing a subsequent dedupl
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
Distributed queries · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.