Efficient deduplication database validation
US-2016306560-A1 · Oct 20, 2016 · US
US11775484B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11775484-B2 |
| Application number | US-201916552965-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 27, 2019 |
| Priority date | Aug 27, 2019 |
| Publication date | Oct 3, 2023 |
| Grant date | Oct 3, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The disclosure provides techniques for deduplicating files. The techniques include, upon creating or modifying a file, placing a logical timestamp of the current logical time, within a queue associated with the directory of the file. The techniques further include placing the logical timestamp within a queue of each parent directory of the directory of the file. To determine a set of files for deduplication, the techniques disclosed herein identify files that have been modified within a logical time range. The set of files modified within a logical time is identified by traversing directories of a storage system, the directories being organized within a tree structure. If a directory's queue does not contain a timestamp that is within the logical time range, then all child directories can be skipped over for further processing, such that no files within the child directories end up being within the set of files for deduplication.
Opening claim text (preview).
What is claimed is: 1. A method of deduplicating one or more files within a storage system, the one or more files organized within a tree of directories, the method comprising: modifying or creating a first file at a first time, wherein the first file is located within a first directory and is a child of the first directory; responsive to the modifying or creating the first file, (a) marking the first directory with a timestamp associated with the first time, and (b) marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp; and deduplicating the one or more files, wherein deduplicating the one or more files comprises: traversing the tree of directories from parent to child starting at a root to identify directories of the tree of directories marked with a corresponding timestamp within a determined time range, wherein the determined time range comprises a start time prior in time than the first time and an end time later in time than the first time; while traversing, locating at least one directory not marked with a corresponding timestamp within the determined time range; skipping traversing at least one child directory of the at least one directory; searching within the identified directories to identify files modified or created during the determined time range, wherein the identified files comprise the first file; and deduplicating the identified files. 2. The method of claim 1 , wherein directories of the tree of directories other than the identified directories are not searched to identify files modified or created during the determined time range. 3. The method of claim 1 , further comprising: maintaining a logical clock comprising a counter, wherein each value of the logical clock maps to an actual time, wherein timestamps of directories indicate a value with respect to the logical clock. 4. The method of claim 3 , further comprising responsive to the modifying or creating the first file, setting a modification time of the first file to a current actual time, wherein files modified or created during the determined time range comprise files with a corresponding modification time within the determined time range. 5. The method of claim 1 , wherein the start time of the determined time range comprises a current time such that the determined time range comprises a time range excluding a time period between a previous time and the current time. 6. The method of claim 1 , wherein marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp comprises: traversing the tree of directories from child to parent starting at the first directory and terminating traversing if a directory that is marked with the timestamp is reached. 7. The method of claim 1 , wherein deduplicating the one or more files further comprises: determining the determined time range, where the determined time range comprises: a time range later in time than a previous time range used for a previous deduplication, when the one or more files were previously deduplicated, or any time range, when the one or more files were not previously deduplicated. 8. A non-transitory computer readable medium comprising instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations for deduplicating one or more files within a storage system, the one or more files organized within a tree of directories, the operations comprising: modifying or creating a first file at a first time, wherein the first file is located within a first directory and is a child of the first directory; responsive to the modifying or creating the first file, (a) marking the first directory with a timestamp associated with the first time, and (b) marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp; and deduplicating the one or more files, wherein deduplicating the one or more files comprises: traversing the tree of directories from parent to child starting at a root directory to identify directories of the tree of directories marked with a corresponding timestamp within a determined time range, wherein the determined time range comprises a start time prior in time than the first time and an end time later in time than the first time; while traversing, locating at least one directory not marked with a corresponding timestamp within the determined time range; skipping traversing at least one child directory of the at least one directory; searching within the identified directories to identify files modified or created during the determined time range, wherein the identified files comprise the first file; and deduplicating the identified files. 9. The non-transitory computer readable medium of claim 8 , wherein directories of the tree of directories other than the identified directories are not searched to identify files modified or created during the determined time range. 10. The non-transitory computer readable medium of claim 8 , wherein the operations further comprise: maintaining a logical clock comprising a counter, wherein each value of the logical clock maps to an actual time, wherein timestamps of directories indicate a value with respect to the logical clock. 11. The non-transitory computer readable medium of claim 10 , wherein the operations further comprise responsive to the modifying or creating the first file, setting a modification time of the first file to a current actual time, wherein files modified or created during the determined time range comprise files with a corresponding modification time within the determined time range. 12. The non-transitory computer readable medium of claim 8 , wherein the start time of the determined time range comprises a current time such that the determined time range comprises a time range excluding a time period between a previous time and the current time. 13. The non-transitory computer readable medium of claim 8 , wherein marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp comprises: traversing the tree of directories from child to parent starting at the first directory and terminating traversing if a directory that is marked with the timestamp is reached. 14. The non-transitory computer readable medium of claim 8 , wherein deduplicating the one or more files further comprises: determining the determined time range, where the determined time range comprises: a time range later in time than a previous time range used for a previous deduplication, when the one or more files were previously deduplicated, or any time range, when the one or more files were not previously deduplicated. 15. A computer system comprising: one or more processors; and at least one memory, the one or more processors and the at least one memory configured to: modify or create a first file at a first time, wherein the first file is located within a first directory and is a child of the first directory; responsive to the modifying or creating the first file, (a) mark the first directory with a timestamp associated with the first time, and (b) mark with the timestamp each parent directory of the first directory that is not already marked with the timestamp; and deduplicate one or more files, wherein to deduplicate comprises to: traverse a tree of directories from parent to child starting at a root directory to identify directories of the tree of directories marked with a corresponding timestamp within a determined time rang
based on file chunks · CPC title
using file content signatures, e.g. hash values · CPC title
Trees · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.