Method for checking the integrity of a compute node
US-2024303346-A1 · Sep 12, 2024 · US
US9952932B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9952932-B2 |
| Application number | US-201514930165-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 2, 2015 |
| Priority date | Nov 2, 2015 |
| Publication date | Apr 24, 2018 |
| Grant date | Apr 24, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer implemented method for providing fault tolerance to a plurality of instances in a system including a plurality of surviving instances includes: determining, for each of the surviving instances, an aggregate load by: retrieving a job load of each job assigned to the respective surviving instance; and summing the job loads of all of the jobs assigned to the respective surviving instance; and selecting to recover and perform, by one of the surviving instances, an orphaned job based upon the aggregate loads of the surviving instances.
Opening claim text (preview).
What is claimed is: 1. A computer implemented method for providing fault tolerance to a plurality of instances in a system including a plurality of surviving instances, the method comprising: determining, for each of the surviving instances, an aggregate load by: retrieving a job load of each job assigned to the respective surviving instance; and summing the job loads of all of the jobs assigned to the respective surviving instance; and selecting to recover and perform, by one of the surviving instances, one of a plurality of orphaned jobs based upon the aggregate loads of the surviving instances, the selecting further comprising: retrieving a job load of each orphaned job; calculating, for each surviving instance, a load discount or penalty of each orphaned job if it were to be accepted by the respective surviving instance; and selecting, by a surviving instance, an orphaned job of the plurality of orphaned jobs based upon: (i) the job loads of the orphaned jobs; (ii) the load discounts or penalties of the orphaned jobs calculated for the respective surviving instance; and (iii) the aggregate loads of the surviving instances until all of the orphaned jobs have been selected to be recovered and performed by the surviving instances. 2. The method of claim 1 , wherein the system includes a plurality of orphaned jobs, the method including: retrieving a job load of each orphaned job; and selecting, by a surviving instance, the orphaned job with the highest job load based upon the aggregate loads of the surviving instances until all of the orphaned jobs have been selected by the surviving instances. 3. The method of claim 2 , wherein, if a plurality of orphaned jobs have the same highest job load, a surviving instance selects an orphaned job from the plurality of orphaned jobs having the same highest job load based upon a tiebreaker. 4. The method of claim 3 , wherein a surviving instance selects an orphaned job from the plurality of orphaned jobs having the same highest job load based upon an alphabetical order of the names of the plurality of orphaned jobs having the same highest job load. 5. The method of claim 1 , which includes selecting the orphaned job by the surviving instance with the lowest aggregate load. 6. The method of claim 5 , wherein, if a plurality of surviving instances have the same lowest aggregate load, a surviving instance from the plurality of surviving instances having the same lowest aggregate load selects the orphaned job based upon a tiebreaker. 7. The method of claim 6 , wherein a surviving instance from the plurality of surviving instances having the same lowest aggregate load selects the orphaned job based upon an alphabetical order of the names of the plurality of surviving instances having the same lowest aggregate load. 8. The method of claim 1 , which includes: determining, for each of the surviving instances, an available capacity of the respective surviving instance as a difference between a total capacity of the respective surviving instance and the aggregate load of the respective surviving instance; and selecting the orphaned job by the surviving instance with the highest available capacity. 9. The method of claim 8 , wherein, if a plurality of surviving instances have the same highest available capacity, a surviving instance from the plurality of surviving instances having the same highest available capacity selects the orphaned job based upon a tiebreaker. 10. The method of claim 9 , wherein a surviving instance from the plurality of surviving instances having the same highest available capacity selects the orphaned job based upon an alphabetical order of the names of the plurality of surviving instances having the same highest available capacity. 11. The method of claim 1 , which includes recovering, by the surviving instance that selects the orphaned job, job recovery data associated with the orphaned job. 12. The method of claim 11 , wherein the job recovery data is stored in one of: a synchronization device in communication with each of the instances in the plurality of instances; one or more of the instances in the plurality of instances; or an external data store. 13. The method of claim 11 , wherein recovering the job recovery data includes one or more of: replaying recovery messages; accessing a snapshot; or accessing state data of the orphaned job. 14. The method of claim 1 , wherein the system is an exchange system. 15. The method of claim 1 , wherein the orphaned job results from a terminating event; and wherein the terminating event is one of: an instance failure; a job failure; a manual termination of an instance by a user of the system; or a manual termination of a job by a user of the system. 16. The method of claim 1 , wherein the orphaned job and the surviving instance that selects the orphaned job both run on one machine. 17. The method of claim 1 , wherein each instance in the plurality of instances is an instance of a same application. 18. The method of claim 17 , wherein each of the jobs assigned to an instance relates to the application. 19. The method of claim 17 , wherein the application is one of: a market data generator application for generating market data from a match engine; or a trade match report generator application for reporting trade data from a match engine. 20. The method of claim 1 , wherein the job load of a job is one or more of: calculated during a duration of the job; or predetermined before the job begins. 21. The method of claim 20 , wherein the job load of a job is calculated based upon at least one of: computational resources needed to perform the job or a throughput of the job. 22. The method of claim 1 , wherein each of the jobs is a long-term job that generates or processes more than one transaction. 23. The method of claim 22 , wherein the transaction includes saving or writing data to a memory or database or sending a message to a component in a different system. 24. The method of claim 1 , wherein the job load of each job fluctuates between full capacity and idle during a duration of the job. 25. The method of claim 1 , which includes: calculating a recovery load of the orphaned job; and selecting, by one of the surviving instances, the orphaned job based upon (i) the aggregate loads of the surviving instances; (ii) the job load of the orphaned job; and (iii) the recovery load of the orphaned job. 26. The method of claim 25 , wherein the recovery load of the orphaned job is different than the job load of the orphaned job. 27. The method of claim 25 , wherein the recovery load of the orphaned job is calculated after the occurrence of a terminating event that results in the orphaned job. 28. A computer implemented method for providing fault tolerance to a plurality of machines in a system including at least two orphaned jobs and at least two surviving machines, the method comprising: retrieving a job load of each orphaned job; determining, for each of the surviving machines, an aggregate load by: retrieving a job load of each job assigned to the respective surviving machine; and summing the job loads of all of the jobs assigned to the respective surviving machine; and calculating, for each surviving machine, a load discount or penalty of each orphaned job if it were to be accepted by the respective surviving machine; and selecting to recover and
Techniques for rebalancing the load in a distributed system · CPC title
involving task migration · CPC title
Active fault masking without idle spares · CPC title
considering the load · CPC title
Resetting or repowering · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.