Clustered Fault Tolerance Systems and Methods Using Load-Based Failover
US-2017123929-A1 · May 4, 2017 · US
US10860411B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10860411-B2 |
| Application number | US-201815938841-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 28, 2018 |
| Priority date | Mar 28, 2018 |
| Publication date | Dec 8, 2020 |
| Grant date | Dec 8, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method implemented by a network element (NE) in a distributed system, the method comprising tracing an execution of a program in the distributed system to produce a record of the execution of the program, wherein the record indicates states of shared resources at various times during the execution of the program, identifying a vulnerable operation that occurred during the program execution based on the record, wherein the record indicates that a first shared resource of the shared resources is in a flawed state after a node that caused the first shared resource to be in the flawed state crashed, and determining that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism.
Opening claim text (preview).
What is claimed is: 1. A method implemented by a network element (NE) in a distributed system, the method comprising: tracing, by a processor of the NE, an execution of a program in the distributed system to produce a record of the execution of the program, the record indicating a state of a shared resource at various times during the execution of the program, the state of the shared resource indicating whether data stored at the shared resource is in an accessible state or in a flawed state based on whether the data is accessible for performing a task by other NEs in the distributed system; identifying, by the processor of the NE, a vulnerable operation that occurred during the program execution based on the record indicating that the state of the shared resource changed from the accessible state to the flawed state, the vulnerable operation comprising a sequence of actions excluding a state correction action that restores a state of the shared resource to the accessible state, the vulnerable operation occurring in response to detecting that a shared resource of a plurality of shared resources is in a flawed state and in response to detecting a crash in a node in the distributed system, the crash causing the shared resource to be in the flawed state; determining, by the processor of the NE, that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism; causing, by the processor of the NE, the fault-tolerance mechanism to be implemented for the program. 2. The method of claim 1 , wherein the record indicates that the vulnerable operation is protected by the fault-tolerance mechanism. 3. The method of claim 1 , wherein the vulnerable operation occurs when the node in the distributed system leaves the shared resource in the flawed state before the node crashes. 4. The method of claim 1 , wherein the vulnerable operation occurs when the node in the distributed system recovers as a recovery node after maintaining the shared resource in the flawed state. 5. The method of claim 1 , wherein the fault-tolerance mechanism is a timeout mechanism. 6. The method of claim 1 , wherein the record indicates that the vulnerable operation is not protected by the fault-tolerance mechanism. 7. An apparatus implemented as a network element (NE), comprising: a memory storage comprising instructions; and one or more processors in communication with the memory storage, wherein the one or more processors execute the instructions to: trace an execution of a program in the distributed system to produce a record of the execution of the program, the record indicating a state of a shared resource at various times during the execution of the program, the state of the shared resource indicating whether data stored at the shared resource is in an accessible state or in a flawed state based on whether the data is accessible for performing a task by other NEs in the distributed system; identify a vulnerable operation that occurred during the program execution based on the record indicating that the state of the shared resource changed from the accessible state to the flawed state, the vulnerable operation comprising a sequence of actions excluding a state correction action that restores a state of the shared resource to the accessible state, the vulnerable operation occurring in response to detecting that a shared resource of a plurality of shared resources is in a flawed state and in response to detecting a crash in a node in the distributed system, the crash causing the shared resource to be in the flawed state; and determine that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism. 8. The apparatus of claim 7 , wherein the record indicates that the vulnerable operation is protected by the fault-tolerance mechanism. 9. The apparatus of claim 7 , wherein the vulnerable operation occurs when the node in the distributed system leaves the shared resource in the flawed state before the node crashes. 10. The apparatus of claim 7 , wherein the vulnerable operation occurs when the node in the distributed system recovers as a recovery node after maintaining the shared resource in the flawed state. 11. The apparatus of claim 7 , wherein the vulnerable operation comprises executing a write command on the shared resource followed by a read command performed on the shared resource, wherein the write command is performed by a first node in the distributed system, wherein the read command is performed by a second node in the distributed system, and wherein the first node and the second node are different nodes in the distributed system. 12. The apparatus of claim 7 , wherein the vulnerable operation comprises executing a write command performed on the shared resource followed by a read command performed on the shared resource, wherein the write command is performed by the node in the distributed system, and wherein the read command is performed by the node after restarting the node. 13. A non-transitory medium configured to store a computer program product comprising computer executable instructions that when executed by a processor of a network element (NE) cause the processor to: trace an execution of a program in the distributed system to produce a record of the execution of the program, the record indicating states of a state of a shared resource at various times during the execution of the program, the state of the shared resource indicating whether data stored at the shared resource is in an accessible state or in a flawed state based on whether the data is accessible for performing a task by other NEs in the distributed system; identify a vulnerable operation that occurred during the program execution based on the record indicating that the state of the shared resource changed from the accessible state to the flawed state, the vulnerable operation comprising a sequence of actions excluding a state correction action that restores a state of the shared resource to the accessible state, the vulnerable operation occurring in response to detecting that a shared resource of a plurality of shared resources is in a flawed state and in response to detecting a crash in a node in the distributed system, the crash causing the shared resource to be in the flawed state; and determine that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism. 14. The non-transitory medium of claim 13 , wherein the record indicates that the vulnerable operation is protected by the fault-tolerance mechanism. 15. The non-transitory medium of claim 13 , wherein the vulnerable operation occurs when the node in the distributed system leaves the shared resource in the flawed state before the node crashes. 16. The non-transitory medium of claim 13 , wherein the vulnerable operation occurs when the node in the distributed system recovers as a recovery node after maintaining the shared resource in the flawed state. 17. The non-transitory medium of claim 13 , wherein the vulnerable operation comprises executing a write command on the shared resource followed by a read command performed on the shared resource.
for systems · CPC title
Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title
Assessing vulnerabilities and evaluating computer system security · CPC title
in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title
by exceeding a time limit, i.e. time-out, e.g. watchdogs · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.