Multi-petascale highly efficient parallel supercomputer

US9971713B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9971713-B2
Application numberUS-201514701371-A
CountryUS
Kind codeB2
Filing dateApr 30, 2015
Priority dateJan 8, 2010
Publication dateMay 15, 2018
Grant dateMay 15, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time and supports DMA functionality allowing for parallel processing message-passing.

First claim

Opening claim text (preview).

The invention claimed is: 1. A parallel computing structure comprising: a plurality of processing nodes interconnected by multiple independent networks, each node including a plurality of processing elements for performing computation or communication activity as required when performing parallel algorithm operations, a first of said networks includes an n-dimensional torus network, n is an integer equal to or greater than 5, including communication links interconnecting said nodes for providing point-to-point and multicast packet communications among said nodes or independent partitioned subsets thereof; said n-dimensional torus network for enabling point-to-point, all-to-all, collective and global barrier and notification functions among said nodes or independent partitioned subsets thereof, wherein combinations of said networks interconnecting said nodes are collaboratively or independently utilized according to bandwidth and latency requirements of an algorithm for optimizing algorithm processing performance; wherein each said processing element is multi-way hardware threaded supporting transactional memory execution and thread level speculation, wherein said plurality of processing elements are configured to run speculative threads in parallel; a cache memory associated with each said processing element at each node, said associated cache memory including a second level (L2) cache supporting thread-level speculative operations (TLS), said TLS operations handling multiple versions of data, and a DMA (direct memory access) network interface for transferring data to/from a cache memory, said DMA interface enabling internode communications that overlap with computations running concurrently on the nodes, wherein a processing element retrieves data by issuing a command and passing the command to each of a stream prefetch engine and a list prefetch engine, the stream prefetch engine and the list prefetch engine for prefetching data to be needed in subsequent clock cycles in a hardware processor of a processing element in response to the passed command, wherein the stream prefetch engine and the list prefetch engine work simultaneously; and wherein the stream prefetch engine is configured to: store the addresses associated with prefetch requests that have been previously issued by the one or more simultaneously operating prefetch engines in a single prefetch data array; determine a slowest data or instruction stream and a fastest data or instruction stream, based on speeds of data or instruction streams processed by the hardware processor, wherein a fast data stream includes data which is requested by the hardware processor but not resident in said single prefetch data array; decrease a prefetching depth of the slowest data or instruction stream, the prefetching depth referring to a specific amount of data or instructions to be prefetched; increase the prefetching depth of the fastest data or instruction stream by the decreased prefetching depth of the slowest data or instruction stream. 2. The parallel computing structure as claimed in claim 1 , wherein n is 5, said 5-D torus network is utilized to enable simultaneous computing and message communication activities among individual nodes and partitioned subsets of nodes according to bandwidth and latency requirements of an algorithm being performed. 3. The parallel computing structure as claimed in claim 2 , wherein said 5-D network is utilized to enable simultaneous computing and message communication activities among individual nodes and independent parallel processing among one or more partitioned subsets of said plurality of nodes according to needs of a parallel algorithm. 4. The parallel computing structure as claimed in claim 3 , wherein said 5-D network is utilized to enable dynamic switching between computing and message communication activities among individual nodes according to needs of a parallel algorithm. 5. The parallel computing structure as claimed in claim 1 , further comprising a look-up engine for determining whether data requested in the command has been prefetched, said look-up engine comprising: a comparator for comparing an address in the command and addresses for which prefetch requests have been issued. 6. The parallel computing structure as claimed in claim 5 , wherein the stream prefetch engine issues a load command for the requested data to a memory system in response to determining that the requested data has not been prefetched, wherein the stream prefetch engine and the list prefetch engine work simultaneously. 7. The parallel computing structure as claimed in claim 1 , wherein each node includes 16 or more processing elements each capable of individually or simultaneously working on any combination of computation or communication activity as required when performing particular classes of parallel algorithms. 8. The parallel computing structure as claimed in claim 1 , wherein each processing element (core) includes a central processing unit (CPU) and one or more floating point processing units, a processing node further comprising a local embedded multi-level cache memory, and said prefetch engines, each said prefetch engine incorporated into a lower level cache for prefetching data for a higher level cache, said prefetch engine performing list-based prefetches. 9. A scalable, parallel computing system comprising: a plurality of processing nodes interconnected by independent networks, each processing node including one or more processing elements, said elements including one or more processor cores, and a direct memory access (DMA) for performing computation or communication activity as required when performing parallel algorithm operations; a first independent network comprising an n-dimensional torus network, where n is an integer greater than or equal to 5, including communication links interconnecting said processing nodes in a manner optimized for providing point-to-point and multicast packet communications among said processing nodes or sub-sets of processing nodes of said network; a plurality of Input/Output (I/O) nodes, a second independent network including an external network connecting each I/O node to other processing nodes; wherein sub-sets of processing nodes are interconnected by divisible portions of said first and second networks for dynamically configuring one or more combinations of independent processing networks according to needs of one or more algorithms, wherein each of said configured independent processing networks is utilized to enable simultaneous collaborative processing for optimizing algorithm processing performance, and wherein each said processing element is multi-way hardware threaded supporting transactional memory execution and thread level speculation, wherein said plurality of processing elements are configured to run speculative threads in parallel, a cache memory associated with each said processing element at each node, said associated cache memory including a second level (L2) cache supporting thread-level speculative operations (TLS), said TLS operations handling multiple versions of data, and a DMA (direct memory access) network interface for transferring data to/from a cache memory, said DMA interface enabling internode communications that overlap with computations running concurrently on the nodes, wherein a processing element retrieves data by issuing a command and passing the command to each of a stream prefetch engine and a list prefetch engine, the stream prefetch engine and the list prefetch engine for prefetching data to be needed in subsequent clock cycles in the processor in response to the passed command, wherein the stream prefetch engine and the list prefetch engine work simultaneously; and

Assignees

Inventors

Classifications

  • Architectures of general purpose stored program computers (with program plugboard G06F15/08; multicomputers G06F15/16) · CPC title

  • using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB] · CPC title

  • Using a prefetch buffer or dedicated prefetch cache · CPC title

  • History based prefetching · CPC title

  • with prefetch · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9971713B2 cover?
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and min…
Who is the assignee on this patent?
Globalfoundries Inc
What technology area does this patent fall under?
Primary CPC classification G06F13/287. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).