System, method and apparatus for improving the performance of collective operations in high performance computing

US9391845B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9391845-B2
Application numberUS-201414495190-A
CountryUS
Kind codeB2
Filing dateSep 24, 2014
Priority dateSep 24, 2014
Publication dateJul 12, 2016
Grant dateJul 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System, method, and apparatus for improving the performance of collective operations in High Performance Computing (HPC). Compute nodes in a networked HPC environment form collective groups to perform collective operations. A spanning tree is formed including the compute nodes and switches and links used to interconnect the compute nodes, wherein the spanning tree is configured such that there is only a single route between any pair of nodes in the tree. The compute nodes implement processes for performing the collective operations, which includes exchanging messages between processes executing on other compute nodes, wherein the messages contain indicia identifying collective operations they belong to. Each switch is configured to implement message forwarding operations for its portion of the spanning tree. Each of the nodes in the spanning tree implements a ratcheted cyclical state machine that is used for synchronizing collective operations, along with status messages that are exchanged between nodes. Transaction IDs are also used to detect out-of-order and lost messages.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented in a high performance computing (HPC) environment including a plurality of compute nodes interconnected via a plurality of switches and links, the method comprising: identifying a set of compute nodes from among the plurality of compute nodes to participate in a collective group to perform collective operations; configuring a spanning tree comprising a plurality of nodes including the set of compute nodes and a set of switches interconnected by a plurality of links, the set of switches including edge switches connected to compute nodes and one or more levels of core switches including a core switch at the root of the spanning tree, wherein the spanning tree is configured such that each node in the spanning tree is enabled to communicate with each of the other nodes via a single respective specified route comprising at least one link segment; configuring each switch in the spanning tree to be aware of specified routes involving that switch and one or more message identifiers to be included in collective operation messages used to perform the collective operations; and at each switch, identifying collective operation messages and their destinations and forwarding the collective operations messages along link segments of the specified routes connected to that switch. 2. The method of claim 1 , further comprising employing the switches in the spanning tree to monitor the progress of collective operations and ensure the collective operations remain synchronized. 3. The method of claim 1 , further comprising: implementing a state machine at each node in the spanning tree; exchanging state machine status messages between the nodes in the spanning tree; and employing the state machine status messages and state machines to synchronize collective operations performed by the collective group. 4. The method of claim 3 , wherein the state machine at each node is implemented as a cyclical ratchet under which states may only advance one state at a time, the states including a first state, one or more middle states, and a last state, and wherein the state advances from the last state back to the first state. 5. The method of claim 4 , wherein the state machine states includes an idle state, a filling state, a full state, and an exiting state. 6. The method of claim 1 , further comprising: implementing a transaction Identifying (ID) mechanism at each node in the collective group; and employing transaction IDs to detect out-of-order or lost messages. 7. The method of claim 1 , wherein the collective operation comprises a Barrier operation. 8. The method of claim 1 , further comprising: initiating at least one application at each compute node in the collective group; identifying a master process at each compute node; and notifying a subnet manager (SM) that the compute node is joining a collective group. 9. The method of claim 1 , further comprising: initiating collective operations via individual processes at the compute nodes participating in the collective group; forwarding messages between processes running on the compute nodes via the switches in the spanning tree; detecting that the collective operation has been completed; and notifying the participating compute nodes the operation has been completed. 10. A system comprising: a plurality of compute nodes, each including a processor, at least one network port, and memory in which instructions are stored for implementing one or more processes for facilitating a collective operation; a plurality of switches, each including a plurality of network ports, each switch linked in communication with at least one other switch via a respective link; at least a portion of the switches linked in communication with a compute node, each switch including logic for implementing a forwarding table; a subnet manager (SM), having a processor, memory in which a subnet management application configured to be executed on the processor is stored, and a network port linked in communication with a switch; wherein execution of the instructions in the compute nodes and the subnet management application by the SM performs operations including, sending information from compute nodes to the SM notifying the SM that the compute nodes are joining a collective group; configuring, via the SM, a spanning tree comprising a plurality of nodes including the compute nodes in the collective group and a set of switches including edge switches connected to compute nodes and one or more levels of core switches including a core switch at the root of the spanning tree, wherein the spanning tree is configured such that each node in the spanning tree is enabled to communicate with each of the other nodes via a single respective specified route comprising at least one link segment; configuring each switch in the spanning tree to be aware of specified routes involving that switch and one or more message identifiers to be included in collective operation messages used to perform the collective operations; and at each switch, identifying collective operation messages and their destinations and forwarding the collective operations messages along link segments of the specified routes connected to that switch. 11. The system of claim 10 , wherein the switches are configured to monitor the progress of collective operations and ensure the collective operations remain synchronized. 12. The system of claim 10 , wherein each node in the spanning tree is configured to: implement a state machine; exchange state machine status messages with adjacent nodes; and wherein the nodes are configured to collectively employ the state machine status messages and state machines to synchronize collective operations performed by the collective group. 13. The system of claim 12 , wherein the state machine at each node is implemented as a cyclical ratchet under which states may only advance one state at a time, the states including a first state, one or more middle states, and a last state, and wherein the state advances from the last state back to the first state. 14. The system of claim 13 , wherein the state machine states includes an idle state, a filling state, a full state, and an exiting state. 15. The system of claim 10 , wherein each node in the spanning tree is configured to implement a transaction Identifying (ID) mechanism, and wherein the nodes collectively are configured to employ transaction IDs to detect out-of-order or lost messages. 16. The system of claim 10 , wherein each compute node in the spanning tree is configured to: initiate an application including one or more processes; identify a master process; and notify the (SM) that the compute node is joining a collective group. 17. The system of claim 10 , wherein the compute nodes in the spanning tree are configured to initiate collective operations via individual processes executing on the compute nodes and send messages to processes running on other computer nodes, and wherein the switches are configured to: forward messages between processes executing on the compute nodes; detect that the collective operation has been completed; and notify the participating compute nodes the collective operation has been completed. 18. An apparatus configured to be implemented as a subnet manager in a network environment including a plurality of compute nodes interconnected via a plurality of switches and links, the apparatus comprising: a processor; a network adapter, operatively coupled to the processor having at least one port; and memory in

Assignees

Inventors

Classifications

  • Standardised network management protocols, e.g. simple network management protocol [SNMP] · CPC title

  • comprising network management agents or mobile agents therefor · CPC title

  • H04L45/48Primary

    Routing tree calculation · CPC title

  • Learning-based routing, e.g. using neural networks or artificial intelligence · CPC title

  • Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9391845B2 cover?
System, method, and apparatus for improving the performance of collective operations in High Performance Computing (HPC). Compute nodes in a networked HPC environment form collective groups to perform collective operations. A spanning tree is formed including the compute nodes and switches and links used to interconnect the compute nodes, wherein the spanning tree is configured such that there …
Who is the assignee on this patent?
Heinz Michael, Rimmer Todd, Kunz James, and 2 more
What technology area does this patent fall under?
Primary CPC classification H04L45/48. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Jul 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).