Targeted sweep method for key-value data storage
US-12182106-B2 · Dec 31, 2024 · US
US12561300B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12561300-B1 |
| Application number | US-202519033400-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jan 21, 2025 |
| Priority date | Nov 22, 2024 |
| Publication date | Feb 24, 2026 |
| Grant date | Feb 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system for scaling parallelism in sweeping stale data associated with multi-version data management is disclosed. The system is programmed to manage a targeted sweep of stale data stored on distributed storage associated with a set of computing nodes, where the stale data includes metadata for database transactions. The system is programmed to dynamically determine, on a computing node, how many threads to simultaneously perform deletion for the targeted sweep based on historical performance of the computing nodes, historical performance of the distributed storage associated with the computing nodes, or other factors. The system is programmed to then cause a certain number of threads to run on the computing node, where each thread identifies and deletes specific metadata for data transactions within the scope of the targeted sweep.
Opening claim text (preview).
What is claimed is: 1 . A method of scaling parallelism in sweeping stale data associated with multi-version data management, comprising: managing versions of a cell of a collection of cells in a first database and metadata entries each indicating a timestamp of a write transaction leading to a new version of the cell in a second database, the metadata entries for the collection of cells being stored on a plurality of shards associated with one or more computing nodes; selecting a current sweep timestamp for a current sweep; determining a number of threads to run simultaneously on each computing node of the one or more computing nodes to perform deletion for the current sweep based on a plurality of factors, including historical performance of each computing node or historical performance of local storage associated with each computing node; and causing the number of threads to run simultaneously on each computing node of the one or more computing nodes, a first thread running on a specific computing node of the one or more computing nodes performing: identifying, from metadata entries on a shard associated with the specific computing node, a first set of cells of a certain size with versions written in the first database prior to the current sweep timestamp; and deleting metadata entries corresponding to the first set of cells and having timestamps prior to the current sweep timestamp from the second database in one database transaction, wherein the method is performed by one or more processors. 2 . The method of claim 1 , the current sweep timestamp being a lower of a start timestamp of a longest running write transaction and an earliest readable time. 3 . The method of claim 1 , the determining starting with dividing a given total number of threads by a number of live computing nodes. 4 . The method of claim 1 , the determining being further based on a quantity of resources currently accessible to each computing node for sweep purposes, a total number of live computing nodes, or a total amount of deletion yet to be performed for the current sweep. 5 . The method of claim 1 , the certain size being a predetermined number of metadata entries or a number of metadata entries having timestamps falling in a period of a predetermined length. 6 . The method of claim 1 , the identifying comprising counting a specific cell only once towards the first set of cells when at least one metadata entry for the specific cell each having a timestamp less than the current sweep timestamp is encountered. 7 . The method of claim 1 , a portion of the shard associated with the specific computing node being divided into one or more fine partitions, each fine partition being represented as a row in a table having a first column for an index of the fine partition and a second column for a list of metadata entries, the identifying comprising retrieving a specific row from the table and reviewing columns of the row. 8 . The method of claim 7 , further comprising: receiving information regarding the first set of cells provided by the first thread; recording coverage of a fine partition of the one or more fine partitions by the first set of cells based on the information, a second thread running on the specific computing node identifying, from the shard associated with the specific computing node, a second set of cells with versions written in the first database prior to the current sweep timestamp based on the coverage. 9 . The method of claim 1 , further comprising receiving a periodic performance update related to a computing node or an associated local storage or a periodic resource availability update indicating a quantity of resources accessible from the computing node. 10 . The method of claim 1 , further comprising: determining that a first computing node of the one or more computing nodes is no longer alive; adjusting the number of threads to run on a second computing node of the one or more computing nodes that is still alive to perform deletion for the current sweep without waiting for the first computing node to be alive again. 11 . The method of claim 1 , the deleting further comprising removing versions of a cell written prior to the current sweep timestamp from the first database before removing corresponding metadata entries from the second database. 12 . The method of claim 1 , a second thread running on a certain computing node of the one or more computing nodes performing identifying a second set of cells with versions written in the first database prior to the current sweep timestamp, the causing comprising: evaluating all sets of cells collected by the respective numbers of threads running on the one or more computing nodes; determining the first set of cells intersecting with the second set of cells; and instructing the second thread not to perform deletion based on the second set of cells. 13 . A system for scaling parallelism in sweeping stale data associated with multi-version data management, comprising: a memory; one or more processors coupled to the memory and configured to perform: managing versions of a cell of a collection of cells in a first database and metadata entries each indicating a timestamp of a write transaction leading to a new version of the cell in a second database, the metadata entries for the collection of cells being stored on a plurality of shards associated with one or more computing nodes; selecting a current sweep timestamp for a current sweep; determining a number of threads to run simultaneously on each computing node of the one or more computing nodes to perform deletion for the current sweep based on a plurality of factors, including historical performance of each computing node or historical performance of local storage associated with each computing node; and causing the number of threads to run simultaneously on each computing node of the one or more computing nodes, a first thread running on a specific computing node of the one or more computing nodes performing: identifying, from metadata entries on a shard associated with the specific computing node, a first set of cells of a certain size with versions written in the first database prior to the current sweep timestamp; and deleting metadata entries corresponding to the first set of cells and having timestamps prior to the current sweep timestamp from the second database in one database transaction. 14 . The system of claim 13 , the determining starting with dividing a given total number of threads by a number of live computing nodes. 15 . The system of claim 13 , the determining being further based on a quantity of resources currently accessible to each computing node for sweep purposes, a total number of live computing nodes, or a total amount of deletion yet to be performed for the current sweep. 16 . The system of claim 13 , the identifying comprising counting a specific cell only once towards the first set of cells when at least one metadata entry for the specific cell each having a timestamp less than the current sweep timestamp is encountered. 17 . The system of claim 13 , a portion of the shard associated with the specific computing node being divided into one or more fine partitions, each fine partition being represented as a row in a table having a first column for an index of the fine partition and a second column for a list of metadata entries, the identifying comprising retrieving a specific row from the table and reviewing columns of the row. 18 . The system of claim 13 , the one
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
using timestamps · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.