Parallel runtime execution on multiple processors
US-9471401-B2 · Oct 18, 2016 · US
US9830133B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9830133-B1 |
| Application number | US-201213712659-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 12, 2012 |
| Priority date | Dec 12, 2011 |
| Publication date | Nov 28, 2017 |
| Grant date | Nov 28, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least one local memory unit that allows for data reuse opportunities. The first custom computing apparatus optimizes the code for reduced communication execution on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
Opening claim text (preview).
What is claimed is: 1. A method of orchestrating data movement of a program on a multi-execution unit computing apparatus, the method comprising: receiving in memory on a first computing apparatus, a computer program comprising a set of operations, the first computing apparatus comprising the memory and a processor; transforming the computer program for execution on a second computing apparatus, the second computing apparatus comprising at least one main memory, at least one local memory, and at least one computation unit, each computation unit comprising at least one private memory region, the transformation comprising: producing a tiled variant of the program; generating operations to perform data movements, for elements produced and consumed by tiles according to the tiled variant, between the at least one main memory and the at least one local memory; optimizing the data movement operations using at least one data reuse transform that eliminates at least one of the generated operations for data movement between the at least one main memory and the at least one local memory by copying data from a first location within the at least one local memory to a second location within at least one local memory, to reduce communication cost and memory traffic; and producing an optimized computer program for execution on the second computing apparatus. 2. The method of claim 1 , wherein the step of transforming the computer program is automatically performed by an optimizing compiler using a polyhedral representation. 3. The method of claim 2 , wherein producing a tiled variant of the program distinguishes between inter-tile dimensions and intra-tile dimensions. 4. The method of claim 3 , wherein a placement function determines assignment of a tile of inter-tile loops to processing elements. 5. The method of claim 4 , further comprising detecting opportunities for redundant transfer elimination. 6. The method of claim 5 , further comprising eliminating redundant transfers based on, at least in part, the placement function and dependence information of operations within the tile. 7. The method of claim 3 , wherein a grain of communication representing a data movement of the data movement operations is parameterized by the intra-tile dimensions. 8. The method of claim 7 , wherein redundant transfers are hoisted by at least one level in the loop nest. 9. The method of claim 1 , wherein a value stored in a local memory location addressable by at least two processing elements is reused to replace a transfer of that value from the main memory to the local memory. 10. The method of claim 1 , wherein read-after-read dependences carried by enclosing loops are computed to determine which values in local memory exhibit reuse opportunities. 11. The method of claim 1 , further comprising ordering the addresses accessed by transfers from main memory to increase the amount of reuse from local memory. 12. The method of claim 11 , further comprising introducing redundant communications between main and local memories when the redundant communications increase the amount of memory reuse within local memories. 13. The method of claim 1 , wherein values stored in private memory locations addressable by a single processing element are reused to replace transfers from main memory to local memory. 14. The method of claim 1 , wherein placement functions are embedded into the optimized code as parameters that represent an id of a processing element on which a portion of the optimized program is to execute. 15. The method of claim 14 , wherein rotation of values in registers is performed for values that are reused within the same processing elements. 16. The method of claim 15 , wherein rotation of code that performs memory transfers is performed for values that are reused by different processing elements with different ids. 17. The method of claim 1 , further comprising interchanging loops in data transfer code whose induction variables depend on selected processing element ids to reduce control flow overhead of the optimized program. 18. A custom computing apparatus comprising: at least one processor; a memory coupled to the at least one processor; and a storage medium coupled to the memory and the at least one processor the storage medium comprising a set of processor executable instructions sufficient that when executed by the at least one processor configure the custom computing apparatus to optimize a computer program for execution on a second computing apparatus, the computer program comprising a set of operations, the second computing apparatus comprising at least one main memory, at least one local memory, and at least one computation unit, each computation unit comprising at least one private memory region, the configuration comprising a configuration to: produce a tiled variant of the program; generate operations to perform data movements for elements produced and consumed by tiles according to the tiled variant, between the at least one main memory and the at least one local memory; optimize the data movement operations using at least one data reuse transform that eliminates at least one of the generated operations for data movement between the at least one main memory and the at least one local memory by copying data from a first location within the at least one local memory to a second location within at least one local memory, to reduce communication cost and memory traffic; and produce an optimized computer program for execution on the second computing apparatus. 19. The custom apparatus of claim 18 , wherein the optimization of the program is based on, at least in part, a polyhedral representation. 20. The custom apparatus of claim 19 , wherein the configuration to produce the tiled variant of the program distinguishes between inter-tile dimensions and intra-tile dimensions. 21. The custom apparatus of claim 20 , wherein the configuration is further configured to select placement function that determines assignment of a tile of inter-tile loops to processing elements. 22. The custom apparatus of claim 21 , wherein the configuration is further configured to detect opportunities for redundant transfer elimination. 23. The custom apparatus of claim 22 , wherein the configuration is further configured to eliminate redundant transfers based on, at least in part, the placement function and dependence information of operations within the tile. 24. The custom apparatus of claim 20 , wherein a grain of communication representing a data movement of the data movement operations is parameterized by the intra-tile dimensions. 25. The custom apparatus of claim 24 , wherein the configuration is further configured to hoist redundant transfers by at least one level in the loop nest. 26. The custom apparatus of claim 18 , wherein a value stored in a local memory location is addressable by at least two processing elements, and the configuration is further configured to optimize the program such that the value stored in the local memory is reused to replace a transfer of that value from the main memory to the local memory. 27. The custom apparatus of claim 18 , wherein the configuration is further configured to compute read-after-read dependences carried by enclosing loops to determine which values in local memory exhibit reuse opportunities. 28. The custom apparatus of claim 18 , wherein
Related publications grouped by family.
Answers are generated from the same data shown on this page.