Automatically detecting time-of-fault bugs in cloud systems

US10860411B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10860411-B2
Application numberUS-201815938841-A
CountryUS
Kind codeB2
Filing dateMar 28, 2018
Priority dateMar 28, 2018
Publication dateDec 8, 2020
Grant dateDec 8, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method implemented by a network element (NE) in a distributed system, the method comprising tracing an execution of a program in the distributed system to produce a record of the execution of the program, wherein the record indicates states of shared resources at various times during the execution of the program, identifying a vulnerable operation that occurred during the program execution based on the record, wherein the record indicates that a first shared resource of the shared resources is in a flawed state after a node that caused the first shared resource to be in the flawed state crashed, and determining that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by a network element (NE) in a distributed system, the method comprising: tracing, by a processor of the NE, an execution of a program in the distributed system to produce a record of the execution of the program, the record indicating a state of a shared resource at various times during the execution of the program, the state of the shared resource indicating whether data stored at the shared resource is in an accessible state or in a flawed state based on whether the data is accessible for performing a task by other NEs in the distributed system; identifying, by the processor of the NE, a vulnerable operation that occurred during the program execution based on the record indicating that the state of the shared resource changed from the accessible state to the flawed state, the vulnerable operation comprising a sequence of actions excluding a state correction action that restores a state of the shared resource to the accessible state, the vulnerable operation occurring in response to detecting that a shared resource of a plurality of shared resources is in a flawed state and in response to detecting a crash in a node in the distributed system, the crash causing the shared resource to be in the flawed state; determining, by the processor of the NE, that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism; causing, by the processor of the NE, the fault-tolerance mechanism to be implemented for the program. 2. The method of claim 1 , wherein the record indicates that the vulnerable operation is protected by the fault-tolerance mechanism. 3. The method of claim 1 , wherein the vulnerable operation occurs when the node in the distributed system leaves the shared resource in the flawed state before the node crashes. 4. The method of claim 1 , wherein the vulnerable operation occurs when the node in the distributed system recovers as a recovery node after maintaining the shared resource in the flawed state. 5. The method of claim 1 , wherein the fault-tolerance mechanism is a timeout mechanism. 6. The method of claim 1 , wherein the record indicates that the vulnerable operation is not protected by the fault-tolerance mechanism. 7. An apparatus implemented as a network element (NE), comprising: a memory storage comprising instructions; and one or more processors in communication with the memory storage, wherein the one or more processors execute the instructions to: trace an execution of a program in the distributed system to produce a record of the execution of the program, the record indicating a state of a shared resource at various times during the execution of the program, the state of the shared resource indicating whether data stored at the shared resource is in an accessible state or in a flawed state based on whether the data is accessible for performing a task by other NEs in the distributed system; identify a vulnerable operation that occurred during the program execution based on the record indicating that the state of the shared resource changed from the accessible state to the flawed state, the vulnerable operation comprising a sequence of actions excluding a state correction action that restores a state of the shared resource to the accessible state, the vulnerable operation occurring in response to detecting that a shared resource of a plurality of shared resources is in a flawed state and in response to detecting a crash in a node in the distributed system, the crash causing the shared resource to be in the flawed state; and determine that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism. 8. The apparatus of claim 7 , wherein the record indicates that the vulnerable operation is protected by the fault-tolerance mechanism. 9. The apparatus of claim 7 , wherein the vulnerable operation occurs when the node in the distributed system leaves the shared resource in the flawed state before the node crashes. 10. The apparatus of claim 7 , wherein the vulnerable operation occurs when the node in the distributed system recovers as a recovery node after maintaining the shared resource in the flawed state. 11. The apparatus of claim 7 , wherein the vulnerable operation comprises executing a write command on the shared resource followed by a read command performed on the shared resource, wherein the write command is performed by a first node in the distributed system, wherein the read command is performed by a second node in the distributed system, and wherein the first node and the second node are different nodes in the distributed system. 12. The apparatus of claim 7 , wherein the vulnerable operation comprises executing a write command performed on the shared resource followed by a read command performed on the shared resource, wherein the write command is performed by the node in the distributed system, and wherein the read command is performed by the node after restarting the node. 13. A non-transitory medium configured to store a computer program product comprising computer executable instructions that when executed by a processor of a network element (NE) cause the processor to: trace an execution of a program in the distributed system to produce a record of the execution of the program, the record indicating states of a state of a shared resource at various times during the execution of the program, the state of the shared resource indicating whether data stored at the shared resource is in an accessible state or in a flawed state based on whether the data is accessible for performing a task by other NEs in the distributed system; identify a vulnerable operation that occurred during the program execution based on the record indicating that the state of the shared resource changed from the accessible state to the flawed state, the vulnerable operation comprising a sequence of actions excluding a state correction action that restores a state of the shared resource to the accessible state, the vulnerable operation occurring in response to detecting that a shared resource of a plurality of shared resources is in a flawed state and in response to detecting a crash in a node in the distributed system, the crash causing the shared resource to be in the flawed state; and determine that the vulnerable operation results in a time of fault (TOF) bug based on performing a fault-tolerance mechanism. 14. The non-transitory medium of claim 13 , wherein the record indicates that the vulnerable operation is protected by the fault-tolerance mechanism. 15. The non-transitory medium of claim 13 , wherein the vulnerable operation occurs when the node in the distributed system leaves the shared resource in the flawed state before the node crashes. 16. The non-transitory medium of claim 13 , wherein the vulnerable operation occurs when the node in the distributed system recovers as a recovery node after maintaining the shared resource in the flawed state. 17. The non-transitory medium of claim 13 , wherein the vulnerable operation comprises executing a write command on the shared resource followed by a read command performed on the shared resource.

Assignees

Inventors

Classifications

  • for systems · CPC title

  • G06F16/27Primary

    Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title

  • Assessing vulnerabilities and evaluating computer system security · CPC title

  • in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

  • by exceeding a time limit, i.e. time-out, e.g. watchdogs · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10860411B2 cover?
A method implemented by a network element (NE) in a distributed system, the method comprising tracing an execution of a program in the distributed system to produce a record of the execution of the program, wherein the record indicates states of shared resources at various times during the execution of the program, identifying a vulnerable operation that occurred during the program execution ba…
Who is the assignee on this patent?
Futurewei Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/27. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 08 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).