System and method for rapid fault detection and repair in a shared nothing distributed database
US-2022114192-A1 · Apr 14, 2022 · US
US11514029B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11514029-B2 |
| Application number | US-202017136568-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 29, 2020 |
| Priority date | Oct 14, 2020 |
| Publication date | Nov 29, 2022 |
| Grant date | Nov 29, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A shared-nothing database system is provided in which parallelism and workload balancing are increased by assigning the rows of each table to “slices”, and storing multiple copies (“duplicas”) of each slice across the persistent storage of multiple nodes of the shared-nothing database system. When the data for a table is distributed among the nodes of a shared-nothing system in this manner, requests to read data from a particular row of the table may be handled by any node that stores a duplica of the slice to which the row is assigned. For each slice, a single duplica of the slice is designated as the “primary duplica”. All DML operations (e.g. inserts, deletes, updates, etc.) that target a particular row of the table are performed by the node that has the primary duplica of the slice to which the particular row is assigned. The changes made by the DML operations are then propagated from the primary duplica to the other duplicas (“secondary duplicas”) of the same slice.
Opening claim text (preview).
What is claimed is: 1. A method comprising: storing a plurality of duplicas of a slice in a set of hosts that belong to a distributed database system; wherein the plurality of duplicas include at least: a primary duplica that resides on persistent storage local to a first host of the set of hosts; and a secondary duplica that resides on persistent storage local to a second host of the set of hosts; executing a transaction that performs an update to a particular data item in the slice; using a coordinating engine instance to coordinate execution of the transaction; wherein, during execution of the transaction, the coordinating engine instance causes: the first host to store the update in the primary duplica; and the second host to store the update in the secondary duplica; causing the transaction to enter a preparing state, including: the coordinating engine instance storing data that indicates the transaction is in a preparing state; and the coordinating engine instance sending a prepare message to the second host; in response to the prepare message, the second host storing data to indicate that the transaction is in a preparing state and determining a prepare timestamp for the update to the secondary duplica; while the transaction is in the preparing state, the second host receiving a read request to read the particular data item as of a particular snapshot time; if the particular snapshot time is less than the prepare timestamp for the update, then the second host allowing the read request to read a pre-update version of the particular data item that existed before the update to the secondary duplica; if the particular snapshot time is greater than the prepare timestamp, then: the second host sending an increase-clock message to the coordinating engine instance to cause a first logical clock used by the coordinating engine instance to be set to an updated value that is at least as high as the particular snapshot time; and after sending the increase-clock message, the second host allowing the read request to read the pre-update version of the particular data item; and when the transaction commits, assigning the transaction a commit time that is as least a high as the updated value. 2. The method of claim 1 wherein: the second host has a local logical clock; and determining a prepare timestamp for the update includes selecting a prepare timestamp that is significantly higher than a current value of the local logical clock to increase likelihood that any read operations received by the second host during the preparing state of the transaction will be assigned snapshot times that are less than the prepare timestamp. 3. The method of claim 1 wherein, if the particular snapshot time is greater than the prepare timestamp, then the second host allows the read request to read the pre-update version of the particular data item only after receiving confirmation that the increase-clock message was successfully processed. 4. The method of claim 1 wherein causing the first logical clock used by the coordinating engine instance to be set to an updated value that is at least as high as the particular snapshot time comprises: changing a global prepare time for the transaction to the updated value; and when committing the transaction, setting the first logical clock used by the coordinating engine instance to a value that is at least as high as the global prepare time. 5. A method comprising: storing a first plurality of duplicas of a first slice in a first set of hosts that belong to a distributed database system; wherein the first plurality of duplicas include at least: a first primary duplica that resides on persistent storage local to a first host of the first set of hosts; and a first secondary duplica that resides on persistent storage local to a second host of the first set of hosts; wherein the second host is different than the first host; storing a second plurality of duplicas of a second slice in a second set of hosts that belong to the distributed database system; wherein the second plurality of duplicas include at least: a second primary duplica that resides on persistent storage local to a third host of the second set of hosts; and a second secondary duplica that resides on persistent storage local to a fourth host of the second set of hosts; wherein the fourth host is different than the third host; executing a transaction that: performs a first update to a first version of a first data item in the first slice; and performs a second update to a first version of a second data item in the second slice; wherein executing the transaction includes: the first host storing the first update in the first primary duplica; the first host causing the second host to store the first update in the second primary duplica; the third host storing the second update in the second primary duplica; and the third host causing the fourth host to store the second update in the second secondary duplica. 6. The method of claim 5 wherein: the transaction is submitted to the first host by a client application; the method further comprises, prior to submitting the transaction to the first host, the client application determining whether the first slice or the second slice is to be a controlling slice of the transaction; and the transaction is submitted to the first host by the client application upon determining that the first slice is to be the controlling slice and that the first host hosts the first primary duplica of the first slice. 7. The method of claim 6 wherein the first slice is determined to be the controlling slice based on at least one of: the transaction performing more work on the first slice than the second slice; an amount of data that is local to the first slice; the first slice being touched first by the transaction; or the first slice qualifying as a most reliable slice relative to the slices touched by the transaction. 8. The method of claim 6 further comprising the first host causing the third host to store the second update in the second primary duplica by the first host sending to the third host a DML fragment which, when executed by the third host, causes the third host to store the second update in the second primary duplica. 9. The method of claim 6 further comprising establishing the second host as a backup coordinator for the transaction based on the second host having a secondary duplica of the slice selected as the controlling slice. 10. The method of claim 5 further comprising: causing the transaction to enter a preparing state by storing transaction status data at the first host that indicates that the transaction is in the preparing state; the first host receiving, directly or indirectly, prepare acknowledgements from all hosts that are hosting primary replicas of slices that were updated by the transaction and all hosts that are hosting secondary replicas of slices that were updated by the transaction; in response to receiving, directly or indirectly, prepare acknowledgements from all hosts that are hosting primary replicas of slices that were updated by the transaction and all hosts that are hosting secondary replicas of slices that were updated by the transaction, causing the transaction to enter a committing state by changing the transaction status data at the first host to indicate that the transaction is in the committing state; and during the committing state, the first host sending to the second host a candidate commit time that is at least as high as a current value of a first logical clock at the first host. 11. The method of claim 10 wherein: the prepare acknowledgements include prepare time
Data partitioning, e.g. horizontal or vertical partitioning · CPC title
Updates performed during online database operations; commit processing · CPC title
Managing data history or versioning (querying versioned data G06F16/2474; querying temporal data G06F16/2477) · CPC title
Query execution · CPC title
Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.