Checkpointing

US11768735B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11768735-B2
Application numberUS-202217651012-A
CountryUS
Kind codeB2
Filing dateFeb 14, 2022
Priority dateApr 2, 2019
Publication dateSep 26, 2023
Grant dateSep 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system comprising: a first subsystem comprising at least one first processor, and a second subsystem comprising one or more second processors. A first program is arranged to run on the at least one first processor, the first program being configured to send data from the first subsystem to the second subsystem. A second program is arranged to run on the one more second processors, the second program being configured to operate on the data content from the first subsystem. The first program is configured to set a checkpoint at one or more points in time. At each checkpoint it records in memory of the first subsystem i) a program state of the second program, comprising a state of one or more registers on each of the second processors at the time of the checkpoint, and ii) a copy of the data content sent to the second subsystem since the respective checkpoint.

First claim

Opening claim text (preview).

The invention claimed is: 1. A processing system comprising: a first subsystem comprising one or more first processors, and a second subsystem comprising one or more second processors; wherein a first program is arranged to run on the one or more first processors, the first program being configured to send data from the first subsystem to the second subsystem; wherein a second program is arranged to run on the one or more second processors, the second program being configured to operate on the data sent from the first subsystem; wherein the first program is configured to set a respective checkpoint at each of one or more points in time, whereby at each checkpoint the first program records in a memory of the first subsystem i) a respective program state of the second program, comprising at least a state of one or more registers on each of the one or more second processors at the point in time of the respective checkpoint, and ii) a copy of the data sent to the second subsystem in a time since the respective checkpoint; and wherein the first program is further configured so as, upon detection of a replay event, to re-send from the first subsystem to the second subsystem the data recorded since a selected or most recent one of the one or more checkpoints, and to control the second subsystem to replay at least part of the second program on at least one of the one or more second processors from the selected or most recent checkpoint, starting with the respective recorded program state and operating on the re-sent data. 2. The processing system of claim 1 , wherein the first program is configured to set a respective checkpoint at each of an ongoing series of points in time up to a current checkpoint at a current point in time, and wherein the first program is further configured to discard from the memory the respective program state and data for checkpoints prior to a predetermined window running backward from the current point in time or current checkpoint, wherein the predetermined window is defined as a predetermined length of time prior to the current point in time or a predetermined number of checkpoints prior to the current checkpoint. 3. The processing system of claim 1 , wherein the second subsystem is arranged to operate over a sequence of steps, wherein between the steps of the sequence the second program has a deterministic memory state; and wherein the points in time at which the checkpoints are placed are between the steps of said sequence. 4. The processing system of claim 1 , wherein the one or more second processors are a plurality of second processors; wherein the second subsystem is arranged to operate over a sequence of Bulk Synchronous Parallel (BSP) supersteps of a BSP synchronization scheme for synchronizing between the plurality of second processors; and wherein the points in time at which the checkpoints are placed are between the BSP supersteps. 5. The processing system of claim 1 , wherein the one or more second processors comprises a plurality of tiles; wherein the second subsystem is arranged to operate over a sequence of Bulk Synchronous Parallel (BSP) supersteps of a BSP synchronization scheme for synchronizing between the plurality of tiles; and wherein the points in time at which the checkpoints are placed are between the BSP supersteps. 6. The processing system of claim 1 , wherein the first subsystem further comprises one or more storage devices and/or gateway processors; and the first program is configured to perform the sending by controlling at least one of the one or more storage devices and/or gateway processors to send the data to the second subsystem, and is further configured to control the one or more storage devices and/or gateway processors to send a copy of the data to the one or more first processors; the first program being arranged to perform the recording of the data by recording the copy received from the one or more storage devices and/or gateway processors, and to perform the re-sending by sending from the one or more first processors. 7. The processing system of claim 1 , wherein: the second program is arranged to operate in a series of phases, wherein each phase comprises a respective one or more codelets; and the first program is configured to set each checkpoint between an end of a respective one of the phases and a start of a next phase in the series. 8. The processing system of claim 1 , wherein the one or more second processors comprise a plurality of second processors, and a respective part of the second program is arranged to run on each of the second processors; the replaying of the second program comprising replaying at least the respective part of the second program arranged to run on at least one of the plurality of second processors. 9. The processing system of claim 8 , wherein the first program is configured so as, upon detection of the replay event, to control the second subsystem to replay the second program across all of the second processors from the selected or most recent checkpoint. 10. The processing system of claim 8 , wherein the first program is configured so as, upon detection of the replay event, to control the second subsystem to replay only the respective part or parts of the second program on a selected subset of one or more of the second processors from the selected or most recent checkpoint. 11. The processing system of claim 8 , wherein: the respective part of the second program arranged to run on each second processor comprises one or more respective codelets, and the second program is arranged to operate in a series of phases with a barrier synchronization between at least two of the phases, the barrier synchronization preventing the second program advancing to a next phase until each of the one or more codelets on each of the plurality of second processors running codelets in a current phase have completed; and the first program is configured to set each checkpoint between a respective barrier synchronization and the next phase immediately following the respective barrier synchronization. 12. The processing system of claim 8 , wherein each of the second processors comprises a plurality of tiles, each tile comprising a separate processing unit and memory, and each arranged to run a respective portion of the respective part of the second program. 13. The processing system of claim 12 , wherein the second program is arranged to operate in a series of Bulk Synchronous Parallel (BSP) supersteps, each superstep comprising an exchange phase and a compute phase following the exchange phase, whereby in each superstep: in the compute phase the second processors perform only respective computations or internal exchanges between tiles but not exchanges between the second processors, and in the exchange phase the second processors exchange computation results between one another, wherein the compute phase is separated from the exchange phase of a next superstep by a barrier synchronization, whereby all the second processors must complete their respective computations of the compute phase before any of the second processors is allowed to proceed to the exchange phase of the next superstep, or on each second processor, in the compute phase the tiles on said each second processor perform only respective computations but not exchanges between tiles, and in the exchange phase the tiles on said each second processor exchange computation results between one another, wherein the compute phase is separated from the exchange phase of a next superstep by a barrier synchronization, whereby all the tiles on any given one of the second processors must complete their respective computations of the compute

Assignees

Inventors

Classifications

  • Checkpointing the instruction stream · CPC title

  • Barrier synchronisation · CPC title

  • for bus or memory accesses · CPC title

  • of specific synchronisation aspects · CPC title

  • by tracing the execution of the program · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11768735B2 cover?
A system comprising: a first subsystem comprising at least one first processor, and a second subsystem comprising one or more second processors. A first program is arranged to run on the at least one first processor, the first program being configured to send data from the first subsystem to the second subsystem. A second program is arranged to run on the one more second processors, the second …
Who is the assignee on this patent?
Graphcore Ltd
What technology area does this patent fall under?
Primary CPC classification G06F11/1407. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).