Efficient deduplication database validation
US-9639274-B2 · May 2, 2017 · US
US11455280B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11455280-B2 |
| Application number | US-202016919721-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 2, 2020 |
| Priority date | Dec 7, 2017 |
| Publication date | Sep 27, 2022 |
| Grant date | Sep 27, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A client machine writes to and reads from a virtual disk on a remote storage platform. Metadata is generated and stored in replicas on different metadata nodes of the storage platform. A modified log-structured merge tree is used to store and compact string-sorted tables of metadata. During file storage and compaction, a consistent file identification scheme is used across all metadata nodes. A fingerprint file is calculated for each SST (metadata) file on disk that includes hash values corresponding to regions of the SST file. To synchronize, the fingerprint files of two SST files are compared, and if any hash values are missing from a fingerprint file then the key-value-timestamp triples corresponding to these missing hash values are sent to the SST file that is missing them. The SST file is compacted with the missing triples to create a new version of the SST file. The synchronization is bi-directional.
Opening claim text (preview).
We claim: 1. A system comprising: a plurality of computer nodes, wherein each computer node among the plurality of computer nodes comprises one or more data storage drives and is configured to: retrieve from one of the plurality of computer nodes a first fingerprint file that includes a plurality of first hash values, wherein each first hash value among the plurality of first hash values corresponds to a first region of a first metadata file, wherein the first metadata file includes a first plurality of key-value-timestamp triples each of which uniquely identifies a portion of metadata that pertains to a particular block of data that has been stored to a computer node among the plurality of computer nodes, and wherein each first region of the first metadata file comprises at least part of a key-value-timestamp triple among the first plurality of key-value-timestamp triples; based on indicia that a second metadata file is a replica of the first metadata file, bi-directionally synchronize the first metadata file and the second metadata file, wherein synchronizing bi-directionally comprises: retrieve from a computer node among the plurality of computer nodes a second fingerprint file that includes a plurality of second hash values, wherein each second hash value corresponds to a second region of the second metadata file, wherein each second region comprises at least part of a key-value-timestamp triple among a second plurality of key-value-timestamp triples in the second metadata file, and based on determining that first hash values are not present among the plurality of second hash values, identify in the first metadata file one or more key-value-timestamp triples among the first plurality of key-value-timestamp triples that correspond to the first hash values not present among the plurality of second hash values, and update the second metadata file with the one or more key-value-timestamp triples among the first plurality of key-value-timestamp triples that were identified in the first metadata file, wherein the particular block of data has been stored to a first computer node among the plurality of computer nodes, which is associated with the first metadata file, and has also been stored to a second computer node among the plurality of computer nodes, which is distinct from the first computer node and is associated with the second metadata file, and determine whether all second hash values are present among the plurality of first hash values. 2. The system of claim 1 , wherein each computer node among the plurality of computer nodes is further configured to: based on determining that second hash values are not present among the plurality of first hash values, identify in the second metadata file one or more key-value-timestamp triples among the second plurality of key-value-timestamp triples that correspond to the second hash values not present among the plurality of first hash values; and update the first metadata file with the one or more key-value-timestamp triples among the second plurality of key-value-timestamp triples that were identified in the second metadata file. 3. The system of claim 2 , wherein updating of the first metadata file and updating of the second metadata file synchronizes bi-directionally, between distinct computer nodes among the plurality of computer nodes, metadata files corresponding to the portion of metadata that pertains to the particular block of data. 4. The system of claim 1 , wherein the first hash values that are not present among the plurality of second hash values correspond to missing regions of the second metadata file. 5. The system of claim 1 , wherein each computer node among the plurality of computer nodes is further configured to create a new version of the second metadata file by compacting the second metadata file as updated with the one or more key-value-timestamp triples among the first plurality of key-value-timestamp triples. 6. The system of claim 1 , wherein the first metadata file and the second metadata file are located on different computer nodes among the plurality of computer nodes and have a same file identifier, and wherein the indicia that the second metadata file is a replica of the first metadata file is based on the same file identifier. 7. The system of claim 1 , wherein the first metadata file and the second metadata file are stored on disk by respective computer nodes among the plurality of computer nodes that use a same file identification scheme. 8. The system of claim 1 , wherein the first fingerprint file and the second fingerprint file are retrieved from a same computer node. 9. The system of claim 1 , wherein each first hash value is part of a start-length-hash value triple that uniquely identifies a first region of the first metadata file as stored on disk. 10. The system of claim 1 , wherein each of the first metadata file and the second metadata file is organized as a string-sorted-table (SST). 11. A method comprising: retrieving, from a computer node of a data storage platform, a first fingerprint file that includes a plurality of first hash values, wherein each first hash value among the plurality of first hash values corresponds to a first region of a first metadata file, wherein the first metadata file includes a first plurality of key-value-timestamp triples each of which uniquely identifies a portion of metadata that pertains to a particular block of data that has been stored to a computer node of the data storage platform, and wherein each first region of the first metadata file comprises at least part of a key-value-timestamp triple among the first plurality of key-value-timestamp triples; based on indicia that a second metadata file is a replica of the first metadata file, retrieving from a computer node of the data storage platform a second fingerprint file that includes a plurality of second hash values, wherein each second hash value corresponds to a second region of the second metadata file, wherein each second region comprises at least part of a key-value-timestamp triple among a second plurality of key-value-timestamp triples in the second metadata file; based on determining that first hash values are not present among the plurality of second hash values, identifying in the first metadata file one or more key-value-timestamp triples among the first plurality of key-value-timestamp triples that correspond to the first hash values not present among the plurality of second hash values; updating the second metadata file with the one or more key-value-timestamp triples among the first plurality of key-value-timestamp triples identified in the first metadata file; based on determining that second hash values are not present among the plurality of first hash values correspond to missing regions of the first metadata file, identifying in the second metadata file one or more key-value-timestamp triples among the second plurality of key-value-timestamp triples that correspond to the second hash values not present among the plurality of first hash values; and updating the first metadata file with the one or more key-value-timestamp triples among the second plurality of key-value-timestamp triples identified in the second metadata file; and wherein each computer node of the data storage platform comprises one or more data storage drives. 12. The method of claim 11 , wherein the first hash values that are not present among the plurality of second hash values correspond to missing regions of the second metadata file, and wherein the second hash values that are not present among the plurality of first hash values correspond to missing regions of the first metadata file.
Techniques for file synchronisation in file systems · CPC title
using data annotations, e.g. user-defined metadata · CPC title
Distributed queries · CPC title
Distributed file systems · CPC title
Synchronous replication · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.