Methods and apparatus for parallel processing
US-2016313991-A1 · Oct 27, 2016 · US
US2017123795A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017123795-A1 |
| Application number | US-201514932672-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 4, 2015 |
| Priority date | Nov 4, 2015 |
| Publication date | May 4, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus and method for supporting simultaneous multiple iterations (SMI) in a course grained reconfigurable architecture (CGRA). In support of SMI, the apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. SMI permits execution of the next instruction within any iteration (in flight). If instructions from multiple iterations are ready for execution (and are pre-decoded), then the hardware selects the lowest iteration number ready for execution. If in a particular clock cycle, a loop iteration with a lower iteration number is stalled (i.e., is waiting for data), the instruction from the next highest iteration number that is ready thereby will be automatically executed automatically allowing the CGRA to have high ILP by overlapping concurrent loop iterations.
Opening claim text (preview).
What is claimed is: 1 . An apparatus comprising: a plurality of processing elements (PE), each element comprising a hardware device providing a runtime mechanism for executing program code instructions including a loop, each PE running multiple concurrent iterations of the same loop; and a load and storage unit (LSU) including multiple banks of load storage queues (LSQ) for storing load instructions and store instructions issued by the PEs associated with the multiple concurrent iterations of the same loop and enabling completion of iterations in order. 2 . The apparatus of claim 1 , wherein the plurality of hardware devices comprise an application specific integrated circuit (ASIC) or a Field-Programmable Gate Array. 3 . The apparatus of claim 1 , further comprising: an execution control unit (ECU) for synchronizing operations performed at each said PE and the LSU including tracking of the iterations that have completed, which iterations are already running, and which iterations are yet to begin, the ECU for communicating signals to and receiving signals from each PE and LSU to synchronize initiating and completing of said multiple concurrent iterations on all or a sub-set of the plurality of PEs. 4 . The apparatus of claim 3 , wherein each PE comprises: an instruction buffer for storing a plurality of instructions, each said buffer storing one or more instructions, the one or more instructions being re-used as a program executes loop iterations; and a program counter associated with each iteration, for selecting an instruction from said instruction buffer to run on the PE. 5 . The apparatus of claim 3 , wherein each PE comprises: structure to receive from a program code compiler device a plurality of instructions for storage in said instruction buffer, the instructions corresponding to a particular program code portion having concurrent iterative operations. 6 . The apparatus of claim 3 , wherein each PE comprises: a plurality of register files, each register file for storing temporary results of a computation or results of a load and/or store operation, wherein said plurality of register files comprises: local register files for storing variable data that is passed across a commit boundary, a commit boundary being defined a loop entry, iteration boundary, and loop exit points; and output register files for storing register data that is consumed within a commit boundary. 7 . The apparatus of claim 6 , wherein for a PE, said local register files are organized according to multiple logical banks, wherein one logical bank of said multiple banks is configured to hold data used by all iterations, n logical banks of said multiple banks configured to hold data for n concurrent iterations in flight, and one bank is configured for storing data of a last committed iteration, said PE implementing a rotating head pointer to point to a first bank associated with an oldest iteration of an innermost loop, wherein upon committing the oldest iteration, said PE starting a new iteration, and for a new iteration started, a new oldest iteration corresponds to a second bank, and moving said pointer to point to said second bank, wherein said pointer rotates to always point to a bank associated with an oldest iteration amongst said n logical banks. 8 . The apparatus of claim 3 , further comprising: communications structure for communicating signals between each PE and said ECU, wherein, a first communication comprises: signals for synchronizing operations performed at each said PE and the ECU including: a first LSYNC signal issued by one or more PEs to indicate to the ECU that a new loop is ready to begin execution in the PE; and a global GSYNC signal issued by the ECU to PEs when a loop execution may commence, wherein the loop to be executed is an inner loop, wherein each of the one or more PEs receive the GSYNC to begin execution, responsive to receiving an LSYNC signals from each PE. 9 . The apparatus of claim 8 , wherein said ECU includes a Global loop counter (GLCR) for maintaining values of a current loop count for each of said iterations in flight, said GLCR counter associated with each iteration and maintaining a state of a respective iteration of said multiple concurrent iterations, a PE communicating signals to said global loop counter to one or more of: indicate a beginning of an individual loop iteration; indicate an end of an individual loop iteration; and communicate loop related parameters, said loop related parameters including a specification of a maximum number of concurrent iterations that can be executed in parallel. 10 . The apparatus of claim 8 , wherein each of the PEs run operations associated with the loop, said ECU monitoring each of said PEs, wherein responsive to a PE reaching an end point of the loop for a given operation, each PE issuing a LCRINC signal for receipt at the ECU, said ECU receiving issued LCRINC signals from all PPEs executing a loop iteration, said ECU responsively incrementing said GLCR counter by 1. 11 . The apparatus of claim 8 , wherein for a loop with data dependent loop exits, said PE signaling to the ECU that a loop execution terminates after a completion of a current iteration; and said ECU communicating a LOOPCOMPLETE signal to all PEs upon receipt of LCRINC signal from all PEs for the said iteration to indicate that the loop has been terminated; and the PEs resume executing a next instruction after the finishing the loop. 12 . The apparatus of claim 3 , wherein the LSU comprises: a plurality of Iteration-interleaved load-store queues banks (LSQs), each LSQ bank receiving load/store instructions of an iteration executed on the PEs, and holding said instructions, said LSQ banks organized as a circular queue, an oldest iteration being held in the bank at a head of the queue; an iteration-aware arbiter assigning a received load/store instruction to an appropriate queue in a LSQ bank; wherein a PE assigns a unique load/store ID (LSID) to each load/store instruction of an iteration, with each load/store instruction having a dedicated slot in a given LSQ bank based on the LSID; and wherein the LSU is configured to: hold, for all iterations in flight, all stores for any one iteration until an iteration endpoint is reached; and release Loads/Stores of an iteration from the corresponding LSQ only when all the instructions of the iteration are complete. 13 . The apparatus of claim 3 , wherein said PE further selects an iteration independently of other PE, said PE further comprising: a pre-decoder circuit for predecoding a “next” instruction in the instruction buffer, the pre-decoding circuit determining an instruction type as one of a load operation or store operation or a computation operation, said pre-decoder circuit sending out input requests based on an instruction-type, wherein said pre-decoder performs one or more of: permitting execution of the next instruction within any selected iteration (in flight) if it has finished pre-decoding; determining if instructions from multiple iterations are ready for execution by being pre-decoded and each having inputs for these instructions located in associated operand buffers, and if determined, picking an oldest iteration for execution; and determining a stalled condition in a loop iteration with a lower iteration number in a particular clock cycle in which the iteration is waiting for data, and if determined, automatically executing the instruction from a next younger iteration. 14 . A method for running multiple simultaneous instructions in a course grained reconfigurable architecture having a
Decoding the operand specifier, e.g. specifier format · CPC title
LOAD or STORE instructions; Clear instruction · CPC title
Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title
Instruction completion, e.g. retiring, committing or graduating · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.