Clustered fault tolerance systems and methods using load-based failover

US9952932B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9952932-B2
Application numberUS-201514930165-A
CountryUS
Kind codeB2
Filing dateNov 2, 2015
Priority dateNov 2, 2015
Publication dateApr 24, 2018
Grant dateApr 24, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer implemented method for providing fault tolerance to a plurality of instances in a system including a plurality of surviving instances includes: determining, for each of the surviving instances, an aggregate load by: retrieving a job load of each job assigned to the respective surviving instance; and summing the job loads of all of the jobs assigned to the respective surviving instance; and selecting to recover and perform, by one of the surviving instances, an orphaned job based upon the aggregate loads of the surviving instances.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method for providing fault tolerance to a plurality of instances in a system including a plurality of surviving instances, the method comprising: determining, for each of the surviving instances, an aggregate load by: retrieving a job load of each job assigned to the respective surviving instance; and summing the job loads of all of the jobs assigned to the respective surviving instance; and selecting to recover and perform, by one of the surviving instances, one of a plurality of orphaned jobs based upon the aggregate loads of the surviving instances, the selecting further comprising: retrieving a job load of each orphaned job; calculating, for each surviving instance, a load discount or penalty of each orphaned job if it were to be accepted by the respective surviving instance; and selecting, by a surviving instance, an orphaned job of the plurality of orphaned jobs based upon: (i) the job loads of the orphaned jobs; (ii) the load discounts or penalties of the orphaned jobs calculated for the respective surviving instance; and (iii) the aggregate loads of the surviving instances until all of the orphaned jobs have been selected to be recovered and performed by the surviving instances. 2. The method of claim 1 , wherein the system includes a plurality of orphaned jobs, the method including: retrieving a job load of each orphaned job; and selecting, by a surviving instance, the orphaned job with the highest job load based upon the aggregate loads of the surviving instances until all of the orphaned jobs have been selected by the surviving instances. 3. The method of claim 2 , wherein, if a plurality of orphaned jobs have the same highest job load, a surviving instance selects an orphaned job from the plurality of orphaned jobs having the same highest job load based upon a tiebreaker. 4. The method of claim 3 , wherein a surviving instance selects an orphaned job from the plurality of orphaned jobs having the same highest job load based upon an alphabetical order of the names of the plurality of orphaned jobs having the same highest job load. 5. The method of claim 1 , which includes selecting the orphaned job by the surviving instance with the lowest aggregate load. 6. The method of claim 5 , wherein, if a plurality of surviving instances have the same lowest aggregate load, a surviving instance from the plurality of surviving instances having the same lowest aggregate load selects the orphaned job based upon a tiebreaker. 7. The method of claim 6 , wherein a surviving instance from the plurality of surviving instances having the same lowest aggregate load selects the orphaned job based upon an alphabetical order of the names of the plurality of surviving instances having the same lowest aggregate load. 8. The method of claim 1 , which includes: determining, for each of the surviving instances, an available capacity of the respective surviving instance as a difference between a total capacity of the respective surviving instance and the aggregate load of the respective surviving instance; and selecting the orphaned job by the surviving instance with the highest available capacity. 9. The method of claim 8 , wherein, if a plurality of surviving instances have the same highest available capacity, a surviving instance from the plurality of surviving instances having the same highest available capacity selects the orphaned job based upon a tiebreaker. 10. The method of claim 9 , wherein a surviving instance from the plurality of surviving instances having the same highest available capacity selects the orphaned job based upon an alphabetical order of the names of the plurality of surviving instances having the same highest available capacity. 11. The method of claim 1 , which includes recovering, by the surviving instance that selects the orphaned job, job recovery data associated with the orphaned job. 12. The method of claim 11 , wherein the job recovery data is stored in one of: a synchronization device in communication with each of the instances in the plurality of instances; one or more of the instances in the plurality of instances; or an external data store. 13. The method of claim 11 , wherein recovering the job recovery data includes one or more of: replaying recovery messages; accessing a snapshot; or accessing state data of the orphaned job. 14. The method of claim 1 , wherein the system is an exchange system. 15. The method of claim 1 , wherein the orphaned job results from a terminating event; and wherein the terminating event is one of: an instance failure; a job failure; a manual termination of an instance by a user of the system; or a manual termination of a job by a user of the system. 16. The method of claim 1 , wherein the orphaned job and the surviving instance that selects the orphaned job both run on one machine. 17. The method of claim 1 , wherein each instance in the plurality of instances is an instance of a same application. 18. The method of claim 17 , wherein each of the jobs assigned to an instance relates to the application. 19. The method of claim 17 , wherein the application is one of: a market data generator application for generating market data from a match engine; or a trade match report generator application for reporting trade data from a match engine. 20. The method of claim 1 , wherein the job load of a job is one or more of: calculated during a duration of the job; or predetermined before the job begins. 21. The method of claim 20 , wherein the job load of a job is calculated based upon at least one of: computational resources needed to perform the job or a throughput of the job. 22. The method of claim 1 , wherein each of the jobs is a long-term job that generates or processes more than one transaction. 23. The method of claim 22 , wherein the transaction includes saving or writing data to a memory or database or sending a message to a component in a different system. 24. The method of claim 1 , wherein the job load of each job fluctuates between full capacity and idle during a duration of the job. 25. The method of claim 1 , which includes: calculating a recovery load of the orphaned job; and selecting, by one of the surviving instances, the orphaned job based upon (i) the aggregate loads of the surviving instances; (ii) the job load of the orphaned job; and (iii) the recovery load of the orphaned job. 26. The method of claim 25 , wherein the recovery load of the orphaned job is different than the job load of the orphaned job. 27. The method of claim 25 , wherein the recovery load of the orphaned job is calculated after the occurrence of a terminating event that results in the orphaned job. 28. A computer implemented method for providing fault tolerance to a plurality of machines in a system including at least two orphaned jobs and at least two surviving machines, the method comprising: retrieving a job load of each orphaned job; determining, for each of the surviving machines, an aggregate load by: retrieving a job load of each job assigned to the respective surviving machine; and summing the job loads of all of the jobs assigned to the respective surviving machine; and calculating, for each surviving machine, a load discount or penalty of each orphaned job if it were to be accepted by the respective surviving machine; and selecting to recover and

Assignees

Inventors

Classifications

  • Techniques for rebalancing the load in a distributed system · CPC title

  • involving task migration · CPC title

  • Active fault masking without idle spares · CPC title

  • considering the load · CPC title

  • Resetting or repowering · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9952932B2 cover?
A computer implemented method for providing fault tolerance to a plurality of instances in a system including a plurality of surviving instances includes: determining, for each of the surviving instances, an aggregate load by: retrieving a job load of each job assigned to the respective surviving instance; and summing the job loads of all of the jobs assigned to the respective surviving instanc…
Who is the assignee on this patent?
Chicago Mercantile Exchange Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/1441. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 24 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).