Multi-petascale highly efficient parallel supercomputer

US2016011996A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016011996-A1
Application numberUS-201514701371-A
CountryUS
Kind codeA1
Filing dateApr 30, 2015
Priority dateJan 8, 2010
Publication dateJan 14, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time and supports DMA functionality allowing for parallel processing message-passing.

First claim

Opening claim text (preview).

1 . A massively parallel computing structure comprising: a plurality of processing nodes interconnected by multiple independent networks, each node including a plurality of processing elements for performing computation or communication activity as required when performing parallel algorithm operations, a first of said networks includes an n-dimensional torus network, n is an integer equal to or greater than 5, including communication links interconnecting said nodes for providing high-speed, low latency point-to-point and multicast packet communications among said nodes or independent partitioned subsets thereof; said n-dimensional torus network for enabling point-to-point, all-to-all, collective (broadcast, reduce) and global barrier and notification functions among said nodes or independent partitioned subsets thereof, wherein combinations of said networks interconnecting said nodes are collaboratively or independently utilized according to bandwidth and latency requirements of an algorithm for optimizing algorithm processing performance; wherein each said processing element is multi-way hardware threaded supporting transactional memory execution and thread level speculation, wherein said plurality of processing elements are configured to run speculative threads in parallel; a cache memory associated with each said processing element at each node, said associated cache memory including a second level (L2) cache supporting thread-level speculative operations (TLS), said TLS operations handling multiple versions of data, and a DMA (direct memory access) network interface for transferring data to/from a cache memory, said DMA interface enabling internode communications that overlap with computations running concurrently on the nodes, wherein a processing element retrieves data by issuing a command and passing the command to each of a stream prefetch engine and a list prefetch engine, the stream prefetch engine and the list prefetch engine for prefetching data to be needed in subsequent clock cycles in the processor in response to the passed command. 2 . The massively parallel computing structure as claimed in claim 1 , wherein n is 5, said 5-D torus network is utilized to enable simultaneous computing and message communication activities among individual nodes and partitioned subsets of nodes according to bandwidth and latency requirements of an algorithm being performed. 3 . The massively parallel computing structure as claimed in claim 2 , wherein said 5-D network is utilized to enable simultaneous computing and message communication activities among individual nodes and independent parallel processing among one or more partitioned subsets of said plurality of nodes according to needs of a parallel algorithm. 4 . The massively parallel computing structure as claimed in claim 3 , wherein said 5-D network is utilized to enable dynamic switching between computing and message communication activities among individual nodes according to needs of a parallel algorithm. 5 . The massively parallel computing structure as claimed in claim 3 , wherein the stream prefetch engine is configured to: determine a slowest data or instruction stream and a fastest data or instruction stream, based on speeds of data or instruction streams processed by the processor; decrease a prefetching depth of the slowest data or instruction stream, the prefetching depth referring to a specific amount of data or instructions to be prefetched; and increase the prefetching depth of the fastest data or instruction stream by the decreased prefetching depth of the slowest data or instruction stream. 6 . The massively parallel computing structure as claimed in claim 5 , further rcomprising a look-up engine for determining whether data requested in the command has been prefetched, said look-up engine comprising: a comparator for comparing an address in the command and addresses for which prefetch requests have been issued. 7 . The massively parallel computing structure as claimed in claim 6 , wherein the stream prefetch engine issues a load command for the requested data to a memory system in response to determining that the requested data has not been prefetched, wherein the stream prefetch engine and the list prefetch engine work simultaneously. 8 . The massively parallel computing structure as claimed in claim 3 , further comprising: a messaging system associated with a node, said messaging system comprising: a plurality of network transmit devices for transmitting message packets over a network; a network injection queue associated with a network transmit device, each said network injection queue adapted to buffer a packet to be transmitted; injection control unit for receiving and processing requests from processor units at a node for transmitting messages over a network via one or more network transmit devices; a plurality of parallel distributed injection messaging engine units (iMEs) each providing a multi-channel DMA function, each injection messaging engine unit operatively connected with said injection control unit and configured to read data in said associated memory system via said interconnect device, and forming a packet belonging to said message, said packet including a packet header and said read data, an interconnect interface device having one or more ports for coupling each injection message engine unit of said distributed plurality with said interconnect device, each port adapted for forwarding data content read from specified locations in associated memory system to at least one requesting injection messaging engine unit in parallel, said associated memory system including a plurality of injection memory buffers, each injection memory buffer adapted to receive, from a processor, a descriptor associated with a message to be transmitted over a network, said descriptor including a specified target address having said data to be included in said message, one of said injection messaging engine units accessing said descriptor data for reading said data to be included in said message from said memory system, wherein a network transmit device provides a signal to indicate to a corresponding said injection messaging engine unit whether or not there is space in a corresponding network injection queue for writing packet data to the network injection queue, wherein, at said node, two or more packets associated with two or more different messages may be simultaneously formed by a respective two or more injection messaging engine units, in parallel, for simultaneous transmission over said network. 9 . The massively parallel computing structure as claimed in claim 8 , wherein said messaging system further comprises: a plurality of receiver devices for receiving message packets from a network, a network reception queue associated with a receiver device, each network reception queue adapted to buffer said received packet, a reception control unit for receiving information from a processor at a node for handling of packets received over a network; and, a plurality of parallel distributed reception messaging engine units (rMEs) each providing a multi-channel direct memory access (DMA) function, a reception messaging engine unit operatively connected with the reception control unit, said reception messaging engine unit initiates transfer of the received packet directly to a location in the associated memory system, wherein each associated reception message engine unit is coupled with an interconnect device having ports adapted for providing a connection to said interconnect device, 10 . The massively parallel computing structure as claimed in claim 8 , wherein said messaging system transfers blocks via one or more switch master por

Assignees

Inventors

Classifications

  • with prefetch · CPC title

  • using pseudo-associative means, e.g. set-associative or hashing · CPC title

  • Details relating to cache prefetching · CPC title

  • using a cache · CPC title

  • G06F13/287Primary

    Multiplexed DMA (G06F13/30 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016011996A1 cover?
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and min…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F13/287. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).