Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
US-10120685-B2 · Nov 6, 2018 · US
US2017123794A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017123794-A1 |
| Application number | US-201514932629-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 4, 2015 |
| Priority date | Nov 4, 2015 |
| Publication date | May 4, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus and method for supporting simultaneous multiple iterations (SMI) and iteration level commits (ILC) in a course grained reconfigurable architecture (CGRA). The apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. The processing elements, LSU and control unit are configured to commit instructions, and save and restore context at loop iteration boundaries. In doing so, the apparatus tracks and buffers state of in-flight iterations, and detects conditions that prevents an iteration from completion. In support of ILC functions, the LSU is iteration aware and includes: an iteration-interleaved LSQ banks; a Bloom Filter for filtering instructions; a load coalescing buffer.
Opening claim text (preview).
What is claimed is: 1 . An apparatus comprising: a plurality of processing elements (PE), each element employing hardware providing a runtime mechanism for executing program code instructions including a loop, each PE running multiple concurrent iterations of the same loop; a load and storage unit (LSU) including multiple banks of load storage queues (LSQ) for storing load instructions and store instructions associated with the multiple concurrent iterations and enabling completion of iterations in order; an execution control unit (ECU) for synchronizing operations performed at each said PE and the LSU including tracking of the iterations that have completed, which iterations are already running, and which iterations are yet to begin, the ECU for communicating signals to and receiving signals from each PE and LSU to synchronize initiating and completing of said multiple concurrent iterations on all or a sub-set of the plurality of PEs, such that all instructions are committed at loop iteration boundaries. 2 . The apparatus of claim 1 , wherein to track iterations, said LSU comprises: a filter device for tracking all the in-flight instruction's memory address in the LSQ bank, searching, for each memory address of an in-flight instruction, all elements of all load and store queues in parallel, and determining a memory dependency of all in-flight memory instructions across different LSQ banks; and a buffer device accessible by said plurality of said LSQ banks that enable store forwarding to a load instruction by collecting data for a load instruction upon determining multiple dependent store instructions across iterations and/or memory that contribute to the data requested by a load instruction. 3 . The apparatus of claim 2 , wherein to track iterations, a PE issues an associated LD/ST identifier for a respective LOAD DATA/STORE DATA instruction of an iteration received at the LSU, each load/store instruction having a dedicated storage slot in a given LSQ bank based on the LSID, said LD/ST identifier for keeping track of the issued LD request or ST request; and the PE issuing an associated iteration ID field for each iteration of said multiple in-flight iterations, said iteration ID used for ordering loads and stores within and across iterations. 4 . The apparatus of claim 3 , wherein to track iterations, wherein said LSU further comprises: an iteration-aware arbiter device configured to use said associated iteration ID to assign loads/stores instructions to an appropriate LSQ bank; and a dependence predictor module for tracking a violation history of a received input instructions using said associated LSID, a violation history comprising a determination that a current input instruction is younger than the load or older than the load in program order, and determining whether a LD instruction should be deferred or not based on its violation history with dependent ST instruction. 5 . The apparatus of claim 3 , wherein said filter device comprises a bloom filter, said LSU bank using an in-flight instruction's memory address as a hash into the bloom filter to check for matching a dependent load/store, and upon detecting a match in the bloom filter, searching the full LSQ bank associatively for the matching load/store instruction. 6 . The apparatus of claim 5 , wherein said LSU bank further performs: holding for all iterations in flight, all stores for any one iteration until an iteration endpoint is reached; and releasing Loads/Stores of an iteration from the corresponding LSQ only when all the instructions of the iteration are complete. 7 . The apparatus of claim 6 , wherein said LSU bank further: detects, for a load instruction, a collision with a store instruction at a same address. checks all the stores of LSQ unit of earlier iterations to ensure that there are no stores that go to the same address; and upon determining that no store belonging to an earlier iteration goes to the same memory address, commencing the load instruction; and upon determining that a store goes to the same address, waiting until the store at the same memory location and belonging to an earlier iteration executes receives a correct data value into that same memory address. 8 . The apparatus of claim 6 , wherein said LSU bank further: determines whether there are multiple stores at the same address of the younger iterations, selects an iteration closest in time to the current load operation, and waits until that store writes to the same memory address. 9 . The apparatus of claim 5 , wherein said LSU bank further: accesses, for a store instruction, a load table of only bloom filters of LSQ banks having younger iterations; and upon detecting a bloom filter match by associative lookup of the LSQ banks of younger iterations; conducting a flush operation for the iteration of the matching load instruction. 10 . The apparatus of claim 3 , wherein said buffer device accessible by said plurality of said LSQ banks that enable store forwarding further comprises: data storage entries for storing coalesced data; index fields of each byte of the data, the indices including an iteration ID and LSID of the matched store instruction for each byte of matched data, and a bit that indicates whether the byte is sourced from memory or from a forwarding store; and a linked list structure having a pointer pointing to a next available entry in said buffer. 11 . A method for running multiple simultaneous instructions in a course grained reconfigurable architecture having a plurality of processing elements (PEs), the method comprising: providing, at each PE, a runtime mechanism for executing program code instructions including a loop, each PE running multiple concurrent iterations of the same loop; storing, at a load and storage unit (LSU) having multiple banks of load storage queues (LSQ), load instructions and store instructions associated with the multiple concurrent iterations and enabling completion of iterations in order; synchronizing, at an execution control unit (ECU), operations performed at each said PE and the LSU including tracking of the iterations that have completed, which iterations are already running, and which iterations are yet to begin, said synchronizing including communicating signals from the ECU to and receiving signals from each PE and LSU for initiating and completing of said multiple concurrent iterations on all or a sub-set of the plurality of PEs, such that all instructions are committed at loop iteration boundaries. 12 . The method of claim 11 , wherein said tracking of the iterations by said LSU comprises: using a filter device for tracking all the in-flight instruction's memory address in the LSQ bank by searching, for each memory address of an in-flight instruction, all elements of all load and store queues in parallel, and determining a memory dependency of all in-flight memory instructions across different LSQ banks; and providing, at a buffer device accessible by said plurality of said LSQ banks, a store forwarding to a load instruction by collecting data for a load instruction upon determining multiple dependent store instructions across iterations and/or memory that contribute to the data requested by a load instruction. 13 . The method of claim 12 , further comprising: issuing, by a PE, an associated LD/ST identifier for a respective LOAD DATA/STORE DATA instruction of an iteration received at the LSU, each load/store instruction having a dedicated storage slot in a given LSQ bank based on the LSID, said LD/ST identifier for keeping track of the issued LD request or ST request; and issuing, by the PE, an asso
for loops, e.g. loop detection or loop counter · CPC title
Dependency mechanisms, e.g. register scoreboarding · CPC title
Recovery, e.g. branch miss-prediction, exception handling (error detection or correction G06F11/00) · CPC title
controlled by multiple instructions, e.g. MIMD, decoupled access or execute · CPC title
with reconfigurable architecture · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.