Checkpointing

US11263081B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11263081-B2
Application numberUS-201916419361-A
CountryUS
Kind codeB2
Filing dateMay 22, 2019
Priority dateApr 2, 2019
Publication dateMar 1, 2022
Grant dateMar 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system comprising: a first subsystem comprising at least one first processor, and a second subsystem comprising one or more second processors. A first program is arranged to run on the at least one first processor, the first program being configured to send data from the first subsystem to the second subsystem. A second program is arranged to run on the one more second processors, the second program being configured to operate on the data content from the first subsystem. The first program is configured to set a checkpoint at successive points in time. At each checkpoint it records in memory of the first subsystem i) a program state of the second program, comprising a state of one or more registers on each of the second processors at the time of the checkpoint, and ii) a copy of the data content sent to the second subsystem since the respective checkpoint.

First claim

Opening claim text (preview).

What is claimed is: 1. A processing system comprising: a first subsystem comprising one or more first processors, and a second subsystem comprising one or more second processors; wherein a first program is arranged to run on the one or more first processors, the first program being configured to send data from the first subsystem to the second subsystem; wherein a second program is arranged to run on the one or more second processors, the second program being configured to operate on the data sent from the first subsystem; wherein the first program is configured to set a respective checkpoint at each of a plurality of points in time, whereby at each checkpoint the first program records in a memory of the first subsystem i) a respective program state of the second program, comprising at least a state of one or more registers on each of the one or more second processors at a point in time of the respective checkpoint, and ii) a copy of the data sent to the second subsystem in a time since the respective checkpoint; and wherein the first program is further configured so as, upon detection of a replay event, to re-send from the first subsystem to the second subsystem the data recorded since a most recent checkpoint, and to control the second subsystem to replay at least part of the second program on at least one of the one or more second processors from the most recent checkpoint, starting with the respective recorded program state and operating on the re-sent data. 2. The processing system of claim 1 , wherein the first program is configured to perform the sending and re-sending by sending the data from the one or more first processors. 3. The processing system of claim 1 , wherein the first subsystem further comprises one or more storage devices and/or gateway processors; and the first program is configured to perform the sending by controlling at least one of the one or more storage devices and/or gateway processors to send the data to the second subsystem, and is further configured to control the one or more storage devices and/or gateway processors to send a copy of the data to the one or more first processors; the first program being arranged to perform the recording of the data by recording the copy received from the one or more storage devices and/or gateway processors, and to perform the re-sending by sending from the one or more first processors. 4. The processing system of claim 1 , wherein: the second program is arranged to operate in a series of phases, wherein each phase comprises a respective one or more codelets; and the first program is configured to set each checkpoint between an end of a respective one of the phases and a start of a next phase in the series. 5. The processing system of claim 1 , wherein the one or more second processors comprise a plurality of second processors, and a respective part of the second program is arranged to run on each of the second processors; the replaying of the second program comprising replaying at least the respective part of the second program arranged to run on at least one of the plurality of second processors. 6. The processing system of claim 5 , wherein the first program is configured so as, upon detection of the replay event, to control the second subsystem to replay the second program across all of the second processors from the most recent checkpoint. 7. The processing system of claim 5 , wherein the first program is configured so as, upon detection of the replay event, to control the second subsystem to replay only the respective part or parts of the second program on a selected subset of one or more of the second processors from the most recent checkpoint. 8. The processing system of claim 5 , wherein: the respective part of the second program arranged to run on each second processor comprises one or more respective codelets, and the second program is arranged to operate in a series of phases with a barrier synchronization between at least two of the phases, the barrier synchronization preventing the second program advancing to a next phase until each of the one or more codelets on each of the plurality of second processors running codelets in a current phase have completed; and the first program is configured to set each checkpoint between a respective barrier synchronization and the next phase immediately following the respective barrier synchronization. 9. The processing system of claim 5 , wherein each of the second processors comprises a plurality of tiles, each tile comprising a separate processing unit and memory, and each arranged to run a respective portion of the respective part of the second program. 10. The processing system of claim 9 , wherein the second program is arranged to operate in a series of Bulk Synchronous Parallel (BSP) supersteps, each superstep comprising an exchange phase and a compute phase following the exchange phase, whereby in each superstep: in the compute phase the second processors perform only respective computations or internal exchanges between tiles but not exchanges between the second processors, and in the exchange phase the second processors exchange computation results between one another, wherein the compute phase is separated from the exchange phase of a next superstep by a barrier synchronization, whereby all the second processors must complete their respective computations of the compute phase before any of the second processors is allowed to proceed to the exchange phase of the next superstep, or on each second processor, in the compute phase the tiles on the second processor perform only respective computations but not exchanges between tiles, and in the exchange phase the tiles on the second processor exchange computation results between one another, wherein the compute phase is separated from the exchange phase of a next superstep by a barrier synchronization, whereby all the tiles on any given one of the second processors must complete their respective computations of the compute phase before any of those tiles on the given second processor is allowed to proceed to the exchange phase of the next superstep; wherein the first program is configured, in setting each checkpoint, to record which in the series of BSP supersteps the second program has reached at the point in time of the respective checkpoint; and the first program is configured to set each of the checkpoints between the barrier synchronization and a following compute phase in a respective one of the supersteps, the replaying comprising replaying from a start of the compute phase of a most recent recorded BSP superstep. 11. The processing system of claim 1 , wherein the second subsystem comprises an error detection mechanism configured to detect an error in the second subsystem; and wherein the replay event comprises an error, the detection of the replay event comprising detection of the error by the error detection mechanism. 12. The processing system of claim 11 , wherein each of the one or more second processors comprises memory used by at least part of the second program, and the error detection mechanism comprises a memory error detection mechanism for detecting errors in the memory of each of the one or more second processors; and wherein the replay event comprises a memory error in a memory of one of the one or more second processors, the detection of the error being by the error detection mechanism. 13. The processing system of claim 12 , wherein the error detection mechanism comprises a parity check mechanism configured to detect the memory error based on a parity check of redundant parity bits included in the memory. 14. The processing system of claim 12

Assignees

Inventors

Classifications

  • Checkpointing the instruction stream · CPC title

  • using instruction pipelines · CPC title

  • Barrier synchronisation · CPC title

  • Pseudo-random number generators · CPC title

  • Means for acting in the event of power-supply failure or interruption, e.g. power-supply fluctuations (for resetting only G06F1/24) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11263081B2 cover?
A system comprising: a first subsystem comprising at least one first processor, and a second subsystem comprising one or more second processors. A first program is arranged to run on the at least one first processor, the first program being configured to send data from the first subsystem to the second subsystem. A second program is arranged to run on the one more second processors, the second …
Who is the assignee on this patent?
Graphcore Ltd
What technology area does this patent fall under?
Primary CPC classification G06F11/1407. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).