Efficient deduplication using block-based convergent encryption

US11582025B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11582025-B2
Application numberUS-202017037369-A
CountryUS
Kind codeB2
Filing dateSep 29, 2020
Priority dateSep 29, 2020
Publication dateFeb 14, 2023
Grant dateFeb 14, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are described for providing secure storage of data sets while enabling efficient deduplication of data. Each data set can be divided into fixed-length blocks. The plaintext of each block can be convergently encrypted, such as by using a hash of the plaintext as an encryption key, to result in block-level ciphertext that can be stored. If two data sets share blocks, the resulting block-level ciphertext can be expected to overlap, and thus duplicative block-level ciphertexts need not be stored. A manifest can be created to facilitate re-creation of the data set, which manifest identifies the block-level ciphertexts of the data set and a key by which each block-level ciphertext was encrypted. By use of block-level encryption, nearly identical data sets can be largely deduplicated, even if they are not perfectly identical.

First claim

Opening claim text (preview).

What is claimed is: 1. A storage system facilitating deduplication of encrypted data, the storage system comprising: one or more data stores configured to store: block-level ciphertext objects, each block-level ciphertext object representing a fixed-length plaintext object convergently encrypted to result in the block-level ciphertext object; and data set manifests, each data set manifest corresponding to a data set and identifying (i) a collection of block-level ciphertext objects within the one or more data stores that collectively represent the data set, and (ii) for each block-level ciphertext object within the collection, an encryption key by which the block-level ciphertext object was encrypted, wherein each encryption key is deterministically derived based at least in part on a fixed-length plaintext object represented by the block-level ciphertext object; one or more computing devices including a hardware processor and associated with the storage system, the one or more computing devices configured to: obtain, from a computing device of a first user, a request to store a particular data set on the storage system; divide the particular data set into a set of fixed-length plaintext objects, wherein a length of each fixed-length plaintext object is shared with fixed-length plaintext objects generated from one or more additional data sets, of one or more additional users distinct from the first user, stored on the storage system; generate a set of block-level ciphertext objects representing the set of fixed-length plaintext objects in encrypted form, wherein generation of the set of block-level ciphertext objects comprises, for each fixed-length plaintext object of the set of fixed-length plaintext objects, convergently encrypting the fixed-length plaintext object with an encryption key selected based at least in part on the fixed-length plaintext object to result in a block-level ciphertext object representing the fixed-length plaintext object; store, within the one or more data stores, those block-level ciphertext objects of the set of block-level ciphertext objects that are not duplicative of a block-level ciphertext object already stored within the one or more data stores; generate a manifest for the particular data set, the manifest for the particular data set identifying the set of block-level ciphertext objects that collectively represent the particular data set and, for each block-level ciphertext object of the set of block-level ciphertext objects, the encryption key with which the block-level ciphertext object is encrypted; and store the manifest for the particular data set within the one or more data stores. 2. The system of claim 1 , wherein the encryption key of each fixed-length plaintext object is a hash value for the fixed-length plaintext object generated by processing the fixed-length plaintext object through a cryptographic hash algorithm. 3. The system of claim 2 , wherein the cryptographic hash algorithm is one of a Secure Hash Algorithm (SHA) family algorithm or a BLAKE family algorithm. 4. The system of claim 1 , wherein convergently encrypting the fixed-length plaintext object comprises encrypting the fixed-length plaintext object using at least one of a block cipher or a stream cipher. 5. The system of claim 1 , wherein the particular data set is a disk image containing code executable on a serverless code execution system, and wherein the one or more computing devices are further configured to: obtain a request to execute the code on the serverless code execution system; obtain, from the one or more data stores, the manifest for the particular data set; identify, from the manifest for the particular data set, a subset of the set of block-level ciphertext objects that are not cached at the serverless code execution system; retrieve the subset of block-level ciphertext objects; decrypt the set of block-level ciphertext objects using the encryption keys included within the manifest to result in the particular data set; provision a virtualized execution environment of the serverless code execution system with the particular data set; and execute the code within the virtualized execution environment. 6. A method implemented by at least one computing device comprising a processor and associated with a storage system, the method comprising: obtaining from a computing device of a first user a request to store a data set on the storage system; dividing the data set into a set of fixed-length plaintext objects, wherein a length of each fixed-length plaintext object is shared with fixed-length plaintext objects generated from one or more additional data sets, of one or more additional users distinct from the first user, stored on the storage system; generating a set of block-level ciphertext objects representing the set of fixed-length plaintext objects in encrypted form, wherein generating the set of block-level ciphertext objects comprises, for each fixed-length plaintext object of the set of fixed-length plaintext objects, convergently encrypting the fixed-length plaintext object with an encryption key derived based at least in part on the fixed-length plaintext object to result in a block-level ciphertext object representing the fixed-length plaintext object; storing, within the storage system, those block-level ciphertext objects of the set of block-level ciphertext objects that are not duplicative of a block-level ciphertext object that is already stored within the storage system; generating a manifest for the data set, the manifest identifying the set of block-level ciphertext objects that collectively represent the data set and, for each block-level ciphertext object of the set of block-level ciphertext objects, the encryption key with which the block-level ciphertext object is encrypted; and storing the manifest within the storage system. 7. The method of claim 6 further comprising encrypting, using an additional key, a portion of the manifest containing the encryption keys for each block-level ciphertext object of the set of block-level ciphertext objects. 8. The method of claim 6 further comprising generating an identifier for each block-level ciphertext object of the set of block-level ciphertext objects, wherein the identifier for each block-level ciphertext object is at least one of a message authentication code (MAC) of the block-level ciphertext object or a hash value of the block-level ciphertext object. 9. The method of claim 8 , wherein the identifier for each block-level ciphertext object includes a message authentication code (MAC) of the block-level ciphertext object, and wherein the MAC of the block-level ciphertext object is generated based on the block-level ciphertext object and the encryption key by which the block-level ciphertext object is encrypted. 10. The method of claim 8 , wherein the identifier for each block-level ciphertext object includes a message authentication code (MAC) of the block-level ciphertext object, and wherein the MAC is at least one of a hash-based MAC (HMAC), a Galois/Counter Mode MAC (GMAC), or a Poly1305 MAC. 11. The method of claim 8 further comprising determining those block-level ciphertext objects of the set of block-level ciphertext objects that are not duplicative by comparing the MAC of each block-level ciphertext object of the set of block-level ciphertext objects against MACs of block-level ciphertext objects that are already stored within the storage system. 12. The method of claim 6 , wherein convergently encrypting the fixed-length plaintext object comprises encrypting the fixed-length plaintext object using at least one of Advanced Encryption Standard (AES) encryption or

Assignees

Inventors

Classifications

  • to a system of files or objects, e.g. local or distributed file system or database · CPC title

  • based on file chunks · CPC title

  • De-duplication techniques · CPC title

  • in relation to content · CPC title

  • H04L9/0643Primary

    Hash functions, e.g. MD5, SHA, HMAC or f9 MAC · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11582025B2 cover?
Systems and methods are described for providing secure storage of data sets while enabling efficient deduplication of data. Each data set can be divided into fixed-length blocks. The plaintext of each block can be convergently encrypted, such as by using a hash of the plaintext as an encryption key, to result in block-level ciphertext that can be stored. If two data sets share blocks, the resul…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification H04L9/0643. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Feb 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).