Optimizing inline deduplication during copies

US10824359B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10824359-B2
Application numberUS-201715798740-A
CountryUS
Kind codeB2
Filing dateOct 31, 2017
Priority dateOct 31, 2017
Publication dateNov 3, 2020
Grant dateNov 3, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A technique for storing data in a data storage system detects that a read is being performed pursuant to a data copy request. In response, the data storage system stores a digest of the data being read in an entry of a digest cache. Later, when a write pursuant to the same copy request arrives, the storage system obtains the entry from the digest cache and completes the write request without creating a duplicate copy of the data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of copying data in a data storage system, the method comprising: receiving a copy request that specifies copying of a set of data from a source logical address to a target logical address, the data storage system providing a source pointer for the source logical address and a target pointer for the target logical address, the source pointer having a value that points to the set of data; when reading the set of data from the source logical address pursuant to the copy request, performing a digest-caching operation by (i) obtaining a hash digest of the set of data and (ii) creating an entry in a digest cache, the entry storing the hash digest of the set of data and the value of the source pointer; and when writing the set of data to the target logical address pursuant to the copy request, (i) calculating the hash digest of the set of data, (ii) performing a lookup into the digest cache for the calculated hash digest, and (iii) upon finding the calculated hash digest in the entry of the digest cache, performing an inline deduplication operation by setting the target pointer to the value of the source pointer as provided in the entry of the digest cache, such that both the source pointer and the target pointer point to the set of data. 2. The method of claim 1 , wherein reading the set of data is performed by processing a read request by the data storage system, wherein processing the read request includes applying an HD (highly-dedupable) tag to the read request, and wherein the digest-caching operation is performed in response to detecting that an HD tag has been applied to the read request. 3. The method of claim 2 , wherein applying the HD tag to the read request is performed in response to detecting that the read request is issued pursuant to a copy request. 4. The method of claim 3 wherein, when performing the digest-caching operation, obtaining the hash digest of the set of data is performed by accessing the hash digest from memory-resident mapping metadata for mapping the source logical address to the set of data. 5. The method of claim 3 wherein, when performing the digest-caching operation, obtaining the hash digest of the set of data is performed by calculating the hash digest from the set of data after reading the set of data. 6. The method of claim 3 , wherein the data storage system is configured to maintain multiple deduplication domains, wherein the digest cache is global to the multiple deduplication domains, and wherein creating the entry in the digest cache for the set of data further includes storing, in the entry, an identifier of a deduplication domain in which the source logical address is located. 7. The method of claim 6 , further comprising: receiving a second copy request that specifies copying of the set of data from the source logical address to a second target logical address, the second target logical address residing in a second deduplication domain different from that of the source logical address; and when writing the set of data to the second target logical address pursuant to the second copy request, (i) calculating the hash digest of the set of data, (ii) performing a lookup into the digest cache for both the calculated hash digest and an identifier of the second deduplication domain, and (iii) upon failing to find any entry in the digest cache that matches both the calculated hash digest and the identifier of the second deduplication domain, completing a write operation of the set of data without performing an inline deduplication operation. 8. The method of claim 7 , wherein the data storage system provides a second target pointer for the second target logical address, the second target pointer having a value that points to the set of data stored in the second deduplication domain, and wherein, upon failing to find any entry in the digest cache that matches both the calculated hash digest and the identifier of the second deduplication domain, the method further comprises creating a new entry in the digest cache, the new entry storing (i) the hash digest of the set of data, (ii) the value of the second target pointer, and (iii) an identifier of the second deduplication domain. 9. The method of claim 8 , wherein writing the set of data pursuant to the second copy request is performed by issuing a write request that includes an HD (highly-dedupable) tag, the tag applied to the write request in response to detecting that the write request was issued pursuant to a copy request, and wherein creating the new entry in the digest cache is contingent upon detecting the HD tag in the write request. 10. The method of claim 6 , wherein the data storage system maintains the deduplication domains as respective file systems, and wherein storing the identifier of the deduplication domain in the entry of the digest cache includes storing a file system identifier of the file system in which the source logical address is located. 11. The method of claim 10 , wherein the source logical address is a logical address within a source file and wherein the target logical address is a logical address within a target file. 12. The method of claim 11 , wherein the source file and the target file reside within different file systems in the data storage system. 13. A computerized apparatus, comprising control circuitry that includes a set of processing units coupled to memory, the control circuitry constructed and arranged to: receive a copy request that specifies copying of a set of data from a source logical address to a target logical address, the computerized apparatus providing a source pointer for the source logical address and a target pointer for the target logical address, the source pointer having a value that points to the set of data; when reading the set of data from the source logical address pursuant to the copy request, perform a digest-caching operation by (i) obtaining a hash digest of the set of data and (ii) creating an entry in a digest cache, the entry storing the hash digest of the set of data and the value of the source pointer; and when writing the set of data to the target logical address pursuant to the copy request, (i) calculate the hash digest of the set of data, (ii) perform a lookup into the digest cache for the calculated hash digest, and (iii) upon finding the calculated hash digest in the entry of the digest cache, perform an inline deduplication operation by setting the target pointer to the value of the source pointer as provided in the entry of the digest cache, such that both the source pointer and the target pointer point to the set of data. 14. A computer program product including a set of non-transitory, computer-readable media having instructions which, when executed by control circuitry of a computerized apparatus, cause the control circuitry to perform a method for copying data, the method comprising: receiving a copy request that specifies copying of a set of data from a source logical address to a target logical address; providing a source pointer for the source logical address and a target pointer for the target logical address, the source pointer having a value that points to the set of data; when reading the set of data from the source logical address pursuant to the copy request, performing a digest-caching operation by (i) obtaining a hash digest of the set of data and (ii) creating an entry in a digest cache, the entry storing the hash digest of the set of data and the value of the source pointer; and when writing the set of data to the target logical address pursuant to the copy request, (i) calculating the hash digest of the set of data

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Replication mechanisms · CPC title

  • Hash-based (content-based indexing of textual data G06F16/31) · CPC title

  • Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title

  • Saving storage space on storage systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10824359B2 cover?
A technique for storing data in a data storage system detects that a read is being performed pursuant to a data copy request. In response, the data storage system stores a digest of the data being read in an entry of a digest cache. Later, when a write pursuant to the same copy request arrives, the storage system obtains the entry from the digest cache and completes the write request without cr…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 03 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).