Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits

US2017123794A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017123794-A1
Application numberUS-201514932629-A
CountryUS
Kind codeA1
Filing dateNov 4, 2015
Priority dateNov 4, 2015
Publication dateMay 4, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus and method for supporting simultaneous multiple iterations (SMI) and iteration level commits (ILC) in a course grained reconfigurable architecture (CGRA). The apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. The processing elements, LSU and control unit are configured to commit instructions, and save and restore context at loop iteration boundaries. In doing so, the apparatus tracks and buffers state of in-flight iterations, and detects conditions that prevents an iteration from completion. In support of ILC functions, the LSU is iteration aware and includes: an iteration-interleaved LSQ banks; a Bloom Filter for filtering instructions; a load coalescing buffer.

First claim

Opening claim text (preview).

What is claimed is: 1 . An apparatus comprising: a plurality of processing elements (PE), each element employing hardware providing a runtime mechanism for executing program code instructions including a loop, each PE running multiple concurrent iterations of the same loop; a load and storage unit (LSU) including multiple banks of load storage queues (LSQ) for storing load instructions and store instructions associated with the multiple concurrent iterations and enabling completion of iterations in order; an execution control unit (ECU) for synchronizing operations performed at each said PE and the LSU including tracking of the iterations that have completed, which iterations are already running, and which iterations are yet to begin, the ECU for communicating signals to and receiving signals from each PE and LSU to synchronize initiating and completing of said multiple concurrent iterations on all or a sub-set of the plurality of PEs, such that all instructions are committed at loop iteration boundaries. 2 . The apparatus of claim 1 , wherein to track iterations, said LSU comprises: a filter device for tracking all the in-flight instruction's memory address in the LSQ bank, searching, for each memory address of an in-flight instruction, all elements of all load and store queues in parallel, and determining a memory dependency of all in-flight memory instructions across different LSQ banks; and a buffer device accessible by said plurality of said LSQ banks that enable store forwarding to a load instruction by collecting data for a load instruction upon determining multiple dependent store instructions across iterations and/or memory that contribute to the data requested by a load instruction. 3 . The apparatus of claim 2 , wherein to track iterations, a PE issues an associated LD/ST identifier for a respective LOAD DATA/STORE DATA instruction of an iteration received at the LSU, each load/store instruction having a dedicated storage slot in a given LSQ bank based on the LSID, said LD/ST identifier for keeping track of the issued LD request or ST request; and the PE issuing an associated iteration ID field for each iteration of said multiple in-flight iterations, said iteration ID used for ordering loads and stores within and across iterations. 4 . The apparatus of claim 3 , wherein to track iterations, wherein said LSU further comprises: an iteration-aware arbiter device configured to use said associated iteration ID to assign loads/stores instructions to an appropriate LSQ bank; and a dependence predictor module for tracking a violation history of a received input instructions using said associated LSID, a violation history comprising a determination that a current input instruction is younger than the load or older than the load in program order, and determining whether a LD instruction should be deferred or not based on its violation history with dependent ST instruction. 5 . The apparatus of claim 3 , wherein said filter device comprises a bloom filter, said LSU bank using an in-flight instruction's memory address as a hash into the bloom filter to check for matching a dependent load/store, and upon detecting a match in the bloom filter, searching the full LSQ bank associatively for the matching load/store instruction. 6 . The apparatus of claim 5 , wherein said LSU bank further performs: holding for all iterations in flight, all stores for any one iteration until an iteration endpoint is reached; and releasing Loads/Stores of an iteration from the corresponding LSQ only when all the instructions of the iteration are complete. 7 . The apparatus of claim 6 , wherein said LSU bank further: detects, for a load instruction, a collision with a store instruction at a same address. checks all the stores of LSQ unit of earlier iterations to ensure that there are no stores that go to the same address; and upon determining that no store belonging to an earlier iteration goes to the same memory address, commencing the load instruction; and upon determining that a store goes to the same address, waiting until the store at the same memory location and belonging to an earlier iteration executes receives a correct data value into that same memory address. 8 . The apparatus of claim 6 , wherein said LSU bank further: determines whether there are multiple stores at the same address of the younger iterations, selects an iteration closest in time to the current load operation, and waits until that store writes to the same memory address. 9 . The apparatus of claim 5 , wherein said LSU bank further: accesses, for a store instruction, a load table of only bloom filters of LSQ banks having younger iterations; and upon detecting a bloom filter match by associative lookup of the LSQ banks of younger iterations; conducting a flush operation for the iteration of the matching load instruction. 10 . The apparatus of claim 3 , wherein said buffer device accessible by said plurality of said LSQ banks that enable store forwarding further comprises: data storage entries for storing coalesced data; index fields of each byte of the data, the indices including an iteration ID and LSID of the matched store instruction for each byte of matched data, and a bit that indicates whether the byte is sourced from memory or from a forwarding store; and a linked list structure having a pointer pointing to a next available entry in said buffer. 11 . A method for running multiple simultaneous instructions in a course grained reconfigurable architecture having a plurality of processing elements (PEs), the method comprising: providing, at each PE, a runtime mechanism for executing program code instructions including a loop, each PE running multiple concurrent iterations of the same loop; storing, at a load and storage unit (LSU) having multiple banks of load storage queues (LSQ), load instructions and store instructions associated with the multiple concurrent iterations and enabling completion of iterations in order; synchronizing, at an execution control unit (ECU), operations performed at each said PE and the LSU including tracking of the iterations that have completed, which iterations are already running, and which iterations are yet to begin, said synchronizing including communicating signals from the ECU to and receiving signals from each PE and LSU for initiating and completing of said multiple concurrent iterations on all or a sub-set of the plurality of PEs, such that all instructions are committed at loop iteration boundaries. 12 . The method of claim 11 , wherein said tracking of the iterations by said LSU comprises: using a filter device for tracking all the in-flight instruction's memory address in the LSQ bank by searching, for each memory address of an in-flight instruction, all elements of all load and store queues in parallel, and determining a memory dependency of all in-flight memory instructions across different LSQ banks; and providing, at a buffer device accessible by said plurality of said LSQ banks, a store forwarding to a load instruction by collecting data for a load instruction upon determining multiple dependent store instructions across iterations and/or memory that contribute to the data requested by a load instruction. 13 . The method of claim 12 , further comprising: issuing, by a PE, an associated LD/ST identifier for a respective LOAD DATA/STORE DATA instruction of an iteration received at the LSU, each load/store instruction having a dedicated storage slot in a given LSQ bank based on the LSID, said LD/ST identifier for keeping track of the issued LD request or ST request; and issuing, by the PE, an asso

Assignees

Inventors

Classifications

  • for loops, e.g. loop detection or loop counter · CPC title

  • Dependency mechanisms, e.g. register scoreboarding · CPC title

  • Recovery, e.g. branch miss-prediction, exception handling (error detection or correction G06F11/00) · CPC title

  • controlled by multiple instructions, e.g. MIMD, decoupled access or execute · CPC title

  • with reconfigurable architecture · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017123794A1 cover?
An apparatus and method for supporting simultaneous multiple iterations (SMI) and iteration level commits (ILC) in a course grained reconfigurable architecture (CGRA). The apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in fli…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F9/3836. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).