Minimizing usage of hardware counters in triggered operations for collective communication
US-2019213146-A1 · Jul 11, 2019 · US
US2021271536A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021271536-A1 |
| Application number | US-202017133559-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 23, 2020 |
| Priority date | Dec 23, 2020 |
| Publication date | Sep 2, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Algorithms for optimizing small message collectives with hardware supported triggered operations and associated methods, apparatus, and systems. The algorithms are implemented in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, where each of the plurality of ranks comprising a compute platform having a communication interface including embedded logic for implementing the algorithms. Collectives are employed to transfer data between parent ranks and child ranks. In connection with the collectives, control messages are sent from children of a collective to the parent of the collective informing the parent that the children of the collective have free buffers ready to receive data. The parent employs a counter to determine that a control message has been received from each of its children indicating each child has a free buffer prior to sending data to the children in the collective.
Opening claim text (preview).
What is claimed is: 1 . A method implemented in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, each of the plurality of ranks comprising a process executing on a compute platform having a communication interface, the method comprising: employing collectives to transfer data between parent ranks and child ranks; sending control messages from children of a collective to the parent of the collective informing the parent that the children of the collective have free buffers ready to receive data; and for a collective, determining at the parent of the collective that a control message has been received from each child in the collective indicating the child has a free buffer to trigger sending data from the parent to the children in the collective, wherein the method is implemented in one or more communication interfaces. 2 . The method of claim 1 , further comprising: at a communication interface for multiple child intermediate nodes comprising children in a collective having the root as a parent, pre-posting one or more buffers; sending a Ready T 0 Receive (RTR) message to the root indicating the intermediate node has a free buffer ready to receive data; at a communication interface for the root, detecting an RTR message has been received from each child intermediate node; and, in response thereto, sending data to each of the child intermediate nodes using a collective. 3 . The method of claim 2 , wherein the plurality of intermediate nodes that are children of the root is m, further comprising: at the communication interface for the root, pre-posting m receives; implementing a counter with a threshold set to m; incrementing the counter for each RTR message received from an intermediate node; when the counter=m, sending the data to the child intermediate nodes using the collective. 4 . The method of claim 2 , further comprising: at a communication interface for each a plurality of child leaf nodes that are children of an intermediate node that is a parent for a collective, pre-posting one or more memory buffers; sending an RTR message to the intermediate node indicating a buffer is available to receive data; at a communication interface for the intermediate node, detecting an RTR message has been received from each of the plurality of child leaf nodes; and, in response thereto, sending data to each of the plurality of child leaf nodes using the collective. 5 . The method of claim 4 , wherein the plurality of child leaf nodes comprises n leaf nodes, further comprising: at a communication interface for the intermediate node, posting n receives; implementing a counter with a threshold set to n; receiving data from the root as part of a first collective; incrementing the counter for each RTR message received from a child leaf node; when the counter=n, sending at least a portion of the data received from the root to the plurality of child leaf nodes as part of a second collective. 6 . The method of claim 1 , wherein the collectives are Message Passing Interface (MPI) collectives. 7 . The method of claim 1 , wherein following an initial transfer of data from the root to an intermediate node using a first collective, data are transferred from the root to the intermediate node using subsequent collectives under which control messages are sent from intermediate node to the root at the end of a collective, so that a next time a collective is called, the control messages do not appear on the critical path of the collective operation. 8 . The method of claim 1 , wherein following an initial transfer of data from a parent intermediate node to a plurality of child leaf nodes using a first collective, data are transferred from the intermediate nodes to the child nodes using subsequent collectives under which control messages are sent from child nodes to the intermediate node at the end of a collective, so that a next time a collective is called, the control messages do not appear on the critical path of the collective. 9 . The method of claim 1 , wherein the communication interface comprises one of a network adaptor, network interface controller, host fabric interface, or host controller adaptor. 10 . A communication interface, configured to be implemented in an intermediate node in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes including the intermediate node, and a plurality of leaf nodes, comprising: at least one input/output (I/O) port configured to be coupled to one of a network or fabric to which the root is coupled or configured to be coupled to a peer-to-peer link to which the root is coupled; memory; and embedded logic configured to: allocate a buffer in the memory; assign a counter to the buffer; and send a first Ready To Receive (RTR) message to the root indicating the intermediate node has a free buffer ready to receive data. 11 . The communication interface of 10 , wherein embedded logic is further configured to: receive data from the root; and copy the data to the free buffer, wherein the data received from the root is part of a first collective for which the root is a parent and the intermediate node is one of a plurality of children. 12 . The communication interface of 11 , wherein embedded logic is further configured to: using a second collective, send data copied to the free buffer from the root to one or more child nodes; and send a second RTR message to the root indicating the intermediate node has a free buffer ready to receive data. 13 . The communication interface of 10 , wherein the at least one network port is coupled to a network to which a plurality of child nodes are coupled or includes at least two ports coupled to respective child nodes via respective peer-to-peer links, and where the embedded logic is further configured to: for a collective for which the intermediate node is a parent and data is to be sent from the intermediate node to multiple child nodes that are children for the collective, detect child nodes that are children of the collective; and post a receive for each child node that is a child of the collective. 14 . The communication interface of claim 13 , wherein the embedded logic is further configured to: receive, from the child nodes, RTR messages indicating the child nodes have free buffers available to receive data; detect when an RTR message has been received from each node belonging to the collective; and use the collective to send data to each of the child nodes. 15 . The communication interface of claim 10 , wherein the wherein the communication interface comprises one of a network adaptor, network interface controller, host fabric interface, or host controller adaptor. 16 . The network interface of claim 10 , wherein the embedded logic comprises one or more of: firmware instructions executed on at least one embedded processor or processing element; one or more pre-programmed logic devices or circuitry; and one or more programmable logic devices or circuitry. 17 . A system comprising a plurality of compute platforms coupled in communication in a distributed compute environment, each of the plurality of compute platforms executing one or more ranks and including a communication interface, the ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, wherein the communication interfaces are configured to: employ collectives to
Logical partitioning of resources; Management or configuration of virtualized resources (specific details on emulation or internal functioning of virtual machines G06F9/455) · CPC title
Message passing systems or structures, e.g. queues · CPC title
Buffers; Shared memory; Pipes · CPC title
Grid computing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.