Collective network for computer structures

US10069599B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10069599-B2
Application numberUS-201514972945-A
CountryUS
Kind codeB2
Filing dateDec 17, 2015
Priority dateFeb 25, 2002
Publication dateSep 4, 2018
Grant dateSep 4, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of correcting reduction operations in a presence of packet corruption in a multi-node computer network, the method comprising: a first node of the network sending an operand packet to a second node of the network over a link between the first and second nodes, said packet including one or more operands; the first node keeping a copy of the packet on a sending end of the link until the first node receives an acknowledgment from the second node that the packet was received by the second node without error; the second node transmitting the packet to one or more additional nodes of the network; the second node testing the packet to determine if the packet is error free; and if the second node determines that the packet is not error free, then the second node setting a flag in the packet to mark the packet as corrupt, transmitting the set flag in the packet to said one or more additional nodes, returning an acknowledgement to the first node which specifies that the packet was received with error, and requesting retransmission of the packet; returning the link to a known state, and the first node retransmitting the packet to the second node; and when the re-transmitted packet arrives at the second node, repeating a reduction computation utilizing stored copies of uncorrupted packets. 2. The method according to claim 1 , wherein the nodes of the network are connected together by links, and a multitude of packet types are transmitted between the nodes of the network. 3. The method according to claim 2 , wherein said multitude of packet types includes: data packets for transmitting data and control information between the nodes; empty packets to pass control information back from one of the nodes to another of the nodes when data packets are not available to said one of the nodes; and sync packets to maintain logic level transitions on each of the links when said each of the links is idle, and to pass control information back from one of the nodes to another of the nodes when data packets are not available to said one of the nodes. 4. A method according to claim 1 , comprising the further steps of if the packet is corrupt, the second node and the one or more additional nodes discarding the packet. 5. A method according to claim 1 , wherein the packet includes control information to communicate information between the first and second nodes. 6. The method according to claim 5 , wherein said control information includes information about internal states of the first and second nodes. 7. A method according to claim 6 , wherein the control information includes an acknowledgement to specify whether a packet transmitted previously was received correctly or not. 8. A method according to claim 7 , comprising the further step of the second node using said acknowledgement to request the retransmission of the packet. 9. The method according to claim 1 , further comprising the second node performing an operation using said packet and sending a result of the operation in a further packet to a third node. 10. The method according to claim 9 , wherein when the second node determines that the packet has an error, the second node sending a flag to the third node indicating that the further packet has an error. 11. A system for correcting reduction operations in a presence of packet corruption in a multi-node computer network, the system comprising: a first node of the network for sending an operand packet to a second node of the network over a link between the first and second nodes, said packet including one or more operands; the first node keeping a copy of the packet on a sending end of the link until the first node receives an acknowledgment from the second node that the packet was received by the second node without error; the second node transmitting the packet to one or more additional nodes of the network; the second node testing the packet to determine if the packet is error free; and if the second node determines that the packet is not error free, then the second node setting a flag in the packet to mark the packet as corrupt, transmitting the set flag in the packet to said one or more additional nodes, returning an acknowledgement to the first node which specifies that the packet was received with error, and requesting retransmission of the packet; the first node retransmitting the packet to the second node; and when the re-transmitted packet arrives at the second node, the second node repeating a reduction computation utilizing stored copies of uncorrupted packets. 12. The system according to claim 11 , wherein a multitude of packet types are transmitted between the first and second nodes. 13. The system according to claim 12 , wherein said multitude of packet types includes: data packets for transmitting data and control information between the nodes; empty packets to pass control information back from one of the nodes to another of the nodes when data packets are not available to said one of the nodes; and sync packets to maintain logic level transitions on each of the links when said each of the links is idle, and to pass control information back from one of the nodes to another of the nodes when data packets are not available to said one of the nodes. 14. The system according to claim 11 , wherein the second node performs an operation using said packet and sends a result of the operation in a further packet to a third node. 15. The method according to claim 14 , wherein when the second node determines that the packet has an error, the second node sends a flag to the third node indicating that the further packet has an error. 16. A computer readable program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for correcting reduction operations in a presence of packet corruption in a multi-node computer network, the method steps comprising: a first node of the network sending an operand packet to a second node of the network over a link between the first and second nodes, said packet including one or more operands; the first node keeping a copy of the packet on a sending end of the link until the first node receives an acknowledgment from the second node that the packet was received by the second node without error; the second node transmitting the packet to one or more additional nodes of the network; the second node testing the packet to determine if the packet is error free; and if the second node determines that the packet is not error free, then the second node setting a flag in the packet to mark the packet as corrupt, transmitting the set flag in the packet to said one or more additional nodes, returning an acknowledgement to the first node which specifies that the packet was received with error, and requesting retransmission of the packet; the first node retransmitting the packet to the second node; and when the re-transmitted packet arrives at the second node, the second node repeating a reduction computation utilizing stored copies of uncorrupted packets. 17. The computer readable program storage device according to claim 16 , wherein a multitude of packet types are transmitted between the first and second nodes. 18. The computer readable program storage device according to claim 17 , wherein said multitude of packet types includes: data packets for transmitting data and control information between the nodes; empty packets to pass control information back from one of the nodes to another of the nodes when data pack

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10069599B2 cover?
A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconn…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification H03M13/09. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Sep 04 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).