What technology area does this patent fall under?

Primary CPC classification G06F9/30043. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu May 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits

US2017123795A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2017123795-A1
Application number	US-201514932672-A
Country	US
Kind code	A1
Filing date	Nov 4, 2015
Priority date	Nov 4, 2015
Publication date	May 4, 2017
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus and method for supporting simultaneous multiple iterations (SMI) in a course grained reconfigurable architecture (CGRA). In support of SMI, the apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. SMI permits execution of the next instruction within any iteration (in flight). If instructions from multiple iterations are ready for execution (and are pre-decoded), then the hardware selects the lowest iteration number ready for execution. If in a particular clock cycle, a loop iteration with a lower iteration number is stalled (i.e., is waiting for data), the instruction from the next highest iteration number that is ready thereby will be automatically executed automatically allowing the CGRA to have high ILP by overlapping concurrent loop iterations.

First claim

Opening claim text (preview).

What is claimed is: 1 . An apparatus comprising: a plurality of processing elements (PE), each element comprising a hardware device providing a runtime mechanism for executing program code instructions including a loop, each PE running multiple concurrent iterations of the same loop; and a load and storage unit (LSU) including multiple banks of load storage queues (LSQ) for storing load instructions and store instructions issued by the PEs associated with the multiple concurrent iterations of the same loop and enabling completion of iterations in order. 2 . The apparatus of claim 1 , wherein the plurality of hardware devices comprise an application specific integrated circuit (ASIC) or a Field-Programmable Gate Array. 3 . The apparatus of claim 1 , further comprising: an execution control unit (ECU) for synchronizing operations performed at each said PE and the LSU including tracking of the iterations that have completed, which iterations are already running, and which iterations are yet to begin, the ECU for communicating signals to and receiving signals from each PE and LSU to synchronize initiating and completing of said multiple concurrent iterations on all or a sub-set of the plurality of PEs. 4 . The apparatus of claim 3 , wherein each PE comprises: an instruction buffer for storing a plurality of instructions, each said buffer storing one or more instructions, the one or more instructions being re-used as a program executes loop iterations; and a program counter associated with each iteration, for selecting an instruction from said instruction buffer to run on the PE. 5 . The apparatus of claim 3 , wherein each PE comprises: structure to receive from a program code compiler device a plurality of instructions for storage in said instruction buffer, the instructions corresponding to a particular program code portion having concurrent iterative operations. 6 . The apparatus of claim 3 , wherein each PE comprises: a plurality of register files, each register file for storing temporary results of a computation or results of a load and/or store operation, wherein said plurality of register files comprises: local register files for storing variable data that is passed across a commit boundary, a commit boundary being defined a loop entry, iteration boundary, and loop exit points; and output register files for storing register data that is consumed within a commit boundary. 7 . The apparatus of claim 6 , wherein for a PE, said local register files are organized according to multiple logical banks, wherein one logical bank of said multiple banks is configured to hold data used by all iterations, n logical banks of said multiple banks configured to hold data for n concurrent iterations in flight, and one bank is configured for storing data of a last committed iteration, said PE implementing a rotating head pointer to point to a first bank associated with an oldest iteration of an innermost loop, wherein upon committing the oldest iteration, said PE starting a new iteration, and for a new iteration started, a new oldest iteration corresponds to a second bank, and moving said pointer to point to said second bank, wherein said pointer rotates to always point to a bank associated with an oldest iteration amongst said n logical banks. 8 . The apparatus of claim 3 , further comprising: communications structure for communicating signals between each PE and said ECU, wherein, a first communication comprises: signals for synchronizing operations performed at each said PE and the ECU including: a first LSYNC signal issued by one or more PEs to indicate to the ECU that a new loop is ready to begin execution in the PE; and a global GSYNC signal issued by the ECU to PEs when a loop execution may commence, wherein the loop to be executed is an inner loop, wherein each of the one or more PEs receive the GSYNC to begin execution, responsive to receiving an LSYNC signals from each PE. 9 . The apparatus of claim 8 , wherein said ECU includes a Global loop counter (GLCR) for maintaining values of a current loop count for each of said iterations in flight, said GLCR counter associated with each iteration and maintaining a state of a respective iteration of said multiple concurrent iterations, a PE communicating signals to said global loop counter to one or more of: indicate a beginning of an individual loop iteration; indicate an end of an individual loop iteration; and communicate loop related parameters, said loop related parameters including a specification of a maximum number of concurrent iterations that can be executed in parallel. 10 . The apparatus of claim 8 , wherein each of the PEs run operations associated with the loop, said ECU monitoring each of said PEs, wherein responsive to a PE reaching an end point of the loop for a given operation, each PE issuing a LCRINC signal for receipt at the ECU, said ECU receiving issued LCRINC signals from all PPEs executing a loop iteration, said ECU responsively incrementing said GLCR counter by 1. 11 . The apparatus of claim 8 , wherein for a loop with data dependent loop exits, said PE signaling to the ECU that a loop execution terminates after a completion of a current iteration; and said ECU communicating a LOOPCOMPLETE signal to all PEs upon receipt of LCRINC signal from all PEs for the said iteration to indicate that the loop has been terminated; and the PEs resume executing a next instruction after the finishing the loop. 12 . The apparatus of claim 3 , wherein the LSU comprises: a plurality of Iteration-interleaved load-store queues banks (LSQs), each LSQ bank receiving load/store instructions of an iteration executed on the PEs, and holding said instructions, said LSQ banks organized as a circular queue, an oldest iteration being held in the bank at a head of the queue; an iteration-aware arbiter assigning a received load/store instruction to an appropriate queue in a LSQ bank; wherein a PE assigns a unique load/store ID (LSID) to each load/store instruction of an iteration, with each load/store instruction having a dedicated slot in a given LSQ bank based on the LSID; and wherein the LSU is configured to: hold, for all iterations in flight, all stores for any one iteration until an iteration endpoint is reached; and release Loads/Stores of an iteration from the corresponding LSQ only when all the instructions of the iteration are complete. 13 . The apparatus of claim 3 , wherein said PE further selects an iteration independently of other PE, said PE further comprising: a pre-decoder circuit for predecoding a “next” instruction in the instruction buffer, the pre-decoding circuit determining an instruction type as one of a load operation or store operation or a computation operation, said pre-decoder circuit sending out input requests based on an instruction-type, wherein said pre-decoder performs one or more of: permitting execution of the next instruction within any selected iteration (in flight) if it has finished pre-decoding; determining if instructions from multiple iterations are ready for execution by being pre-decoded and each having inputs for these instructions located in associated operand buffers, and if determined, picking an oldest iteration for execution; and determining a stalled condition in a loop iteration with a lower iteration number in a particular clock cycle in which the iteration is waiting for data, and if determined, automatically executing the instruction from a next younger iteration. 14 . A method for running multiple simultaneous instructions in a course grained reconfigurable architecture having a

Assignees

Inventors

Classifications

G06F9/3016
Decoding the operand specifier, e.g. specifier format · CPC title
G06F9/30043Primary
LOAD or STORE instructions; Clear instruction · CPC title
G06F9/3836Primary
Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title
G06F9/3854
Instruction completion, e.g. retiring, committing or graduating · CPC title
G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title

Patent family

Related publications grouped by family.

View patent family 58635438

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017123795A1 cover?: An apparatus and method for supporting simultaneous multiple iterations (SMI) in a course grained reconfigurable architecture (CGRA). In support of SMI, the apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which a…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F9/30043. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu May 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).