Fast algorithm to find file system difference for deduplication

US11775484B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11775484-B2
Application numberUS-201916552965-A
CountryUS
Kind codeB2
Filing dateAug 27, 2019
Priority dateAug 27, 2019
Publication dateOct 3, 2023
Grant dateOct 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure provides techniques for deduplicating files. The techniques include, upon creating or modifying a file, placing a logical timestamp of the current logical time, within a queue associated with the directory of the file. The techniques further include placing the logical timestamp within a queue of each parent directory of the directory of the file. To determine a set of files for deduplication, the techniques disclosed herein identify files that have been modified within a logical time range. The set of files modified within a logical time is identified by traversing directories of a storage system, the directories being organized within a tree structure. If a directory's queue does not contain a timestamp that is within the logical time range, then all child directories can be skipped over for further processing, such that no files within the child directories end up being within the set of files for deduplication.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of deduplicating one or more files within a storage system, the one or more files organized within a tree of directories, the method comprising: modifying or creating a first file at a first time, wherein the first file is located within a first directory and is a child of the first directory; responsive to the modifying or creating the first file, (a) marking the first directory with a timestamp associated with the first time, and (b) marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp; and deduplicating the one or more files, wherein deduplicating the one or more files comprises: traversing the tree of directories from parent to child starting at a root to identify directories of the tree of directories marked with a corresponding timestamp within a determined time range, wherein the determined time range comprises a start time prior in time than the first time and an end time later in time than the first time; while traversing, locating at least one directory not marked with a corresponding timestamp within the determined time range; skipping traversing at least one child directory of the at least one directory; searching within the identified directories to identify files modified or created during the determined time range, wherein the identified files comprise the first file; and deduplicating the identified files. 2. The method of claim 1 , wherein directories of the tree of directories other than the identified directories are not searched to identify files modified or created during the determined time range. 3. The method of claim 1 , further comprising: maintaining a logical clock comprising a counter, wherein each value of the logical clock maps to an actual time, wherein timestamps of directories indicate a value with respect to the logical clock. 4. The method of claim 3 , further comprising responsive to the modifying or creating the first file, setting a modification time of the first file to a current actual time, wherein files modified or created during the determined time range comprise files with a corresponding modification time within the determined time range. 5. The method of claim 1 , wherein the start time of the determined time range comprises a current time such that the determined time range comprises a time range excluding a time period between a previous time and the current time. 6. The method of claim 1 , wherein marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp comprises: traversing the tree of directories from child to parent starting at the first directory and terminating traversing if a directory that is marked with the timestamp is reached. 7. The method of claim 1 , wherein deduplicating the one or more files further comprises: determining the determined time range, where the determined time range comprises: a time range later in time than a previous time range used for a previous deduplication, when the one or more files were previously deduplicated, or any time range, when the one or more files were not previously deduplicated. 8. A non-transitory computer readable medium comprising instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations for deduplicating one or more files within a storage system, the one or more files organized within a tree of directories, the operations comprising: modifying or creating a first file at a first time, wherein the first file is located within a first directory and is a child of the first directory; responsive to the modifying or creating the first file, (a) marking the first directory with a timestamp associated with the first time, and (b) marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp; and deduplicating the one or more files, wherein deduplicating the one or more files comprises: traversing the tree of directories from parent to child starting at a root directory to identify directories of the tree of directories marked with a corresponding timestamp within a determined time range, wherein the determined time range comprises a start time prior in time than the first time and an end time later in time than the first time; while traversing, locating at least one directory not marked with a corresponding timestamp within the determined time range; skipping traversing at least one child directory of the at least one directory; searching within the identified directories to identify files modified or created during the determined time range, wherein the identified files comprise the first file; and deduplicating the identified files. 9. The non-transitory computer readable medium of claim 8 , wherein directories of the tree of directories other than the identified directories are not searched to identify files modified or created during the determined time range. 10. The non-transitory computer readable medium of claim 8 , wherein the operations further comprise: maintaining a logical clock comprising a counter, wherein each value of the logical clock maps to an actual time, wherein timestamps of directories indicate a value with respect to the logical clock. 11. The non-transitory computer readable medium of claim 10 , wherein the operations further comprise responsive to the modifying or creating the first file, setting a modification time of the first file to a current actual time, wherein files modified or created during the determined time range comprise files with a corresponding modification time within the determined time range. 12. The non-transitory computer readable medium of claim 8 , wherein the start time of the determined time range comprises a current time such that the determined time range comprises a time range excluding a time period between a previous time and the current time. 13. The non-transitory computer readable medium of claim 8 , wherein marking with the timestamp each parent directory of the first directory that is not already marked with the timestamp comprises: traversing the tree of directories from child to parent starting at the first directory and terminating traversing if a directory that is marked with the timestamp is reached. 14. The non-transitory computer readable medium of claim 8 , wherein deduplicating the one or more files further comprises: determining the determined time range, where the determined time range comprises: a time range later in time than a previous time range used for a previous deduplication, when the one or more files were previously deduplicated, or any time range, when the one or more files were not previously deduplicated. 15. A computer system comprising: one or more processors; and at least one memory, the one or more processors and the at least one memory configured to: modify or create a first file at a first time, wherein the first file is located within a first directory and is a child of the first directory; responsive to the modifying or creating the first file, (a) mark the first directory with a timestamp associated with the first time, and (b) mark with the timestamp each parent directory of the first directory that is not already marked with the timestamp; and deduplicate one or more files, wherein to deduplicate comprises to: traverse a tree of directories from parent to child starting at a root directory to identify directories of the tree of directories marked with a corresponding timestamp within a determined time rang

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11775484B2 cover?
The disclosure provides techniques for deduplicating files. The techniques include, upon creating or modifying a file, placing a logical timestamp of the current logical time, within a queue associated with the directory of the file. The techniques further include placing the logical timestamp within a queue of each parent directory of the directory of the file. To determine a set of files for …
Who is the assignee on this patent?
Vmware Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/1752. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).