Algorithms for optimizing small message collectives with hardware supported triggered operations

US2021271536A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021271536-A1
Application numberUS-202017133559-A
CountryUS
Kind codeA1
Filing dateDec 23, 2020
Priority dateDec 23, 2020
Publication dateSep 2, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Algorithms for optimizing small message collectives with hardware supported triggered operations and associated methods, apparatus, and systems. The algorithms are implemented in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, where each of the plurality of ranks comprising a compute platform having a communication interface including embedded logic for implementing the algorithms. Collectives are employed to transfer data between parent ranks and child ranks. In connection with the collectives, control messages are sent from children of a collective to the parent of the collective informing the parent that the children of the collective have free buffers ready to receive data. The parent employs a counter to determine that a control message has been received from each of its children indicating each child has a free buffer prior to sending data to the children in the collective.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method implemented in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, each of the plurality of ranks comprising a process executing on a compute platform having a communication interface, the method comprising: employing collectives to transfer data between parent ranks and child ranks; sending control messages from children of a collective to the parent of the collective informing the parent that the children of the collective have free buffers ready to receive data; and for a collective, determining at the parent of the collective that a control message has been received from each child in the collective indicating the child has a free buffer to trigger sending data from the parent to the children in the collective, wherein the method is implemented in one or more communication interfaces. 2 . The method of claim 1 , further comprising: at a communication interface for multiple child intermediate nodes comprising children in a collective having the root as a parent, pre-posting one or more buffers; sending a Ready T 0 Receive (RTR) message to the root indicating the intermediate node has a free buffer ready to receive data; at a communication interface for the root, detecting an RTR message has been received from each child intermediate node; and, in response thereto, sending data to each of the child intermediate nodes using a collective. 3 . The method of claim 2 , wherein the plurality of intermediate nodes that are children of the root is m, further comprising: at the communication interface for the root, pre-posting m receives; implementing a counter with a threshold set to m; incrementing the counter for each RTR message received from an intermediate node; when the counter=m, sending the data to the child intermediate nodes using the collective. 4 . The method of claim 2 , further comprising: at a communication interface for each a plurality of child leaf nodes that are children of an intermediate node that is a parent for a collective, pre-posting one or more memory buffers; sending an RTR message to the intermediate node indicating a buffer is available to receive data; at a communication interface for the intermediate node, detecting an RTR message has been received from each of the plurality of child leaf nodes; and, in response thereto, sending data to each of the plurality of child leaf nodes using the collective. 5 . The method of claim 4 , wherein the plurality of child leaf nodes comprises n leaf nodes, further comprising: at a communication interface for the intermediate node, posting n receives; implementing a counter with a threshold set to n; receiving data from the root as part of a first collective; incrementing the counter for each RTR message received from a child leaf node; when the counter=n, sending at least a portion of the data received from the root to the plurality of child leaf nodes as part of a second collective. 6 . The method of claim 1 , wherein the collectives are Message Passing Interface (MPI) collectives. 7 . The method of claim 1 , wherein following an initial transfer of data from the root to an intermediate node using a first collective, data are transferred from the root to the intermediate node using subsequent collectives under which control messages are sent from intermediate node to the root at the end of a collective, so that a next time a collective is called, the control messages do not appear on the critical path of the collective operation. 8 . The method of claim 1 , wherein following an initial transfer of data from a parent intermediate node to a plurality of child leaf nodes using a first collective, data are transferred from the intermediate nodes to the child nodes using subsequent collectives under which control messages are sent from child nodes to the intermediate node at the end of a collective, so that a next time a collective is called, the control messages do not appear on the critical path of the collective. 9 . The method of claim 1 , wherein the communication interface comprises one of a network adaptor, network interface controller, host fabric interface, or host controller adaptor. 10 . A communication interface, configured to be implemented in an intermediate node in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes including the intermediate node, and a plurality of leaf nodes, comprising: at least one input/output (I/O) port configured to be coupled to one of a network or fabric to which the root is coupled or configured to be coupled to a peer-to-peer link to which the root is coupled; memory; and embedded logic configured to: allocate a buffer in the memory; assign a counter to the buffer; and send a first Ready To Receive (RTR) message to the root indicating the intermediate node has a free buffer ready to receive data. 11 . The communication interface of 10 , wherein embedded logic is further configured to: receive data from the root; and copy the data to the free buffer, wherein the data received from the root is part of a first collective for which the root is a parent and the intermediate node is one of a plurality of children. 12 . The communication interface of 11 , wherein embedded logic is further configured to: using a second collective, send data copied to the free buffer from the root to one or more child nodes; and send a second RTR message to the root indicating the intermediate node has a free buffer ready to receive data. 13 . The communication interface of 10 , wherein the at least one network port is coupled to a network to which a plurality of child nodes are coupled or includes at least two ports coupled to respective child nodes via respective peer-to-peer links, and where the embedded logic is further configured to: for a collective for which the intermediate node is a parent and data is to be sent from the intermediate node to multiple child nodes that are children for the collective, detect child nodes that are children of the collective; and post a receive for each child node that is a child of the collective. 14 . The communication interface of claim 13 , wherein the embedded logic is further configured to: receive, from the child nodes, RTR messages indicating the child nodes have free buffers available to receive data; detect when an RTR message has been received from each node belonging to the collective; and use the collective to send data to each of the child nodes. 15 . The communication interface of claim 10 , wherein the wherein the communication interface comprises one of a network adaptor, network interface controller, host fabric interface, or host controller adaptor. 16 . The network interface of claim 10 , wherein the embedded logic comprises one or more of: firmware instructions executed on at least one embedded processor or processing element; one or more pre-programmed logic devices or circuitry; and one or more programmable logic devices or circuitry. 17 . A system comprising a plurality of compute platforms coupled in communication in a distributed compute environment, each of the plurality of compute platforms executing one or more ranks and including a communication interface, the ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, wherein the communication interfaces are configured to: employ collectives to

Assignees

Inventors

Classifications

  • G06F9/5077Primary

    Logical partitioning of resources; Management or configuration of virtualized resources (specific details on emulation or internal functioning of virtual machines G06F9/455) · CPC title

  • G06F9/546Primary

    Message passing systems or structures, e.g. queues · CPC title

  • Buffers; Shared memory; Pipes · CPC title

  • Grid computing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021271536A1 cover?
Algorithms for optimizing small message collectives with hardware supported triggered operations and associated methods, apparatus, and systems. The algorithms are implemented in a distributed compute environment comprising a plurality of ranks including a root, a plurality of intermediate nodes, and a plurality of leaf nodes, where each of the plurality of ranks comprising a compute platform h…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/5077. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 02 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).