Hybrid and efficient approach to accelerate complicated loops on coarse-grained reconfigurable arrays (cgra) accelerators
US-2020133672-A1 · Apr 30, 2020 · US
US12039335B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12039335-B2 |
| Application number | US-202217705112-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 25, 2022 |
| Priority date | Mar 25, 2022 |
| Publication date | Jul 16, 2024 |
| Grant date | Jul 16, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Schedule instructions of a program for execution on a coarse grained reconfigurable array having a plurality of tiles operable in parallel. The program identifies data flows through memory locations represented by memory variables and identifies instructions configured to transform data in the data flows. Based on a hardware profile identifying features of the coarse grained reconfigurable array, a scheduler is configured to generate a memory map. The memory map identifies, for each respective memory variable in the program, one of the tiles that contains a memory location represented by the respective memory variable. Based on the memory map reducing possible choices for a brute force search, the scheduler assigns the instructions to the tiles for execution, and determines timing of execution of the instructions in the tiles.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving an assembly language program identifying data flows through memory locations represented by memory variables and identifying instructions configured to transform data in the data flows, wherein the assembly language program is configured to identify the memory variables as the memory locations through which the data is configured to flow through, and the assembly language program is further configured to identify the instructions configured to transform the data as the data flows through the memory locations represented by the memory variables; receiving a hardware profile identifying features of a coarse grained reconfigurable array having a plurality of tiles operable in parallel; generating a memory map identifying, for each respective memory variable in the assembly language program, one of the tiles that contains a memory location represented by the respective memory variable; assigning, based on the memory map, the instructions to the tiles for execution; and providing an instruction execution schedule via identifying timing of execution of the instructions in the tiles. 2. The method of claim 1 , wherein the generating of the memory map includes: partitioning the memory variables and the instructions into a plurality of groups, each of the groups configured to be implemented on one of the tiles. 3. The method of claim 2 , wherein the partitioning is performed to balance a number of instructions implemented per tile. 4. The method of claim 2 , wherein the partitioning is performed to balance a number of memory variables implemented per tile. 5. The method of claim 2 , wherein the partitioning is performed to balance an amount of memory usage implemented per tile. 6. The method of claim 2 , wherein each respective instruction among the instructions is assigned to a tile containing memory variables having data to be operated upon by the respective instruction. 7. The method of claim 6 , wherein the timing of execution of the instructions is determined by: selecting a current instruction for scheduling; identifying a slot in a first tile containing memory variables used by the current instruction; and searching for a clock cycle for execution of the current instruction in the slot. 8. The method of claim 7 , wherein the timing of execution of the instructions is determined further by: selecting a next instruction for scheduling, in response to a determination that a valid spoke RAM slot is found for execution of the current instruction. 9. The method of claim 7 , wherein the timing of execution of the instructions is determined further by: searching, in response to a determination that no valid spoke RAM slot is found for execution of the current instruction in the slot, for an available slot in the first tile; and searching, in response to the available slot being found, for a clock cycle for execution of the current instruction in the available slot. 10. The method of claim 9 , further comprising, in response to no available slot being found: determining a prior instruction scheduled before the current instruction; identifying a first schedule previously determined for the prior instruction as invalid; and starting determination of a second valid schedule for the prior instruction. 11. A computing device, comprising: a memory; and a microprocessor coupled with the memory and configured to: receive an assembly language program identifying data flows through memory locations represented by memory variables and identifying instructions configured to transform data in the data flows, wherein the assembly language program is configured to describe a dispatch interface, at least one memory interface, and the memory locations to be mapped to tile memories, and the assembly language program is further configured to identify processing of the data in the data flows through the at least one memory interface and the memory locations via execution of the instructions; receive a hardware profile identifying features of a coarse grained reconfigurable array having a plurality of tiles operable in parallel; generate a memory map identifying, for each respective memory variable in the assembly language program, one of the tiles that contains a memory location represented by the respective memory variable; assign, based on the memory map, the instructions to the tiles for execution; and provide an instruction execution schedule identifying timing of execution of the instructions in the tiles. 12. The computing device of claim 11 , wherein to generate the memory map, the microprocessor is further configured to: partition the memory variables and the instructions into a plurality of groups, each of the groups configured to be implemented on one of the tiles; wherein each respective instruction among the instructions is assigned to a tile containing memory variables having data to be operated upon by the respective instruction. 13. The computing device of claim 12 , wherein the groups are generated via: balancing a number of instructions implemented per tile; balancing a number of memory variables implemented per tile; or balancing an amount of memory usage implemented per tile; or any combination thereof. 14. The computing device of claim 12 , wherein the timing of execution of the instructions is determined by: selecting a current instruction for scheduling; identifying a slot in a first tile containing memory variables used by the current instruction; and searching for a clock cycle for execution of the current instruction in the slot. 15. The computing device of claim 14 , wherein the timing of execution of the instructions is determined further by: selecting a next instruction for scheduling, in response to a determination that a valid spoke RAM slot is found for execution of the current instruction. 16. The computing device of claim 14 , wherein the timing of execution of the instructions is determined further by: searching, in response to a determination that no valid spoke RAM slot is found for execution of the current instruction in the slot, for an available slot in the first tile; and searching, in response to the available slot being found, for a clock cycle for execution of the current instruction in the available slot. 17. The computing device of claim 16 , further comprising, in response to no available slot being found: determining a prior instruction scheduled before the current instruction; identifying a first schedule previously determined for the prior instruction as invalid; and starting determination of a second valid schedule for the prior instruction. 18. A non-transitory computer storage medium storing instructions which, when executed by a computing device, cause the computing device to perform a method, comprising: receiving an assembly language program identifying data flows through memory locations represented by memory variables to be mapped to tile memories and identifying opcodes configured to transform data in the data flows; receiving a hardware profile identifying features of a coarse grained reconfigurable array having a plurality of tiles operable in parallel; generating a memory map identifying, for each respective memory variable in the assembly language program, one of the tiles that contains a memory location represented by the respective memory variable; assigning, based on the memory map, the opcodes to the tiles for execution; and providing an opcode execution schedule identifying timing of execution of the o
LOAD or STORE instructions; Clear instruction · CPC title
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
Instruction operation extension or modification · CPC title
Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.