Methods and apparatus for automatic communication optimizations in a compiler based on a polyhedral representation

US9830133B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9830133-B1
Application numberUS-201213712659-A
CountryUS
Kind codeB1
Filing dateDec 12, 2012
Priority dateDec 12, 2011
Publication dateNov 28, 2017
Grant dateNov 28, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least one local memory unit that allows for data reuse opportunities. The first custom computing apparatus optimizes the code for reduced communication execution on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of orchestrating data movement of a program on a multi-execution unit computing apparatus, the method comprising: receiving in memory on a first computing apparatus, a computer program comprising a set of operations, the first computing apparatus comprising the memory and a processor; transforming the computer program for execution on a second computing apparatus, the second computing apparatus comprising at least one main memory, at least one local memory, and at least one computation unit, each computation unit comprising at least one private memory region, the transformation comprising: producing a tiled variant of the program; generating operations to perform data movements, for elements produced and consumed by tiles according to the tiled variant, between the at least one main memory and the at least one local memory; optimizing the data movement operations using at least one data reuse transform that eliminates at least one of the generated operations for data movement between the at least one main memory and the at least one local memory by copying data from a first location within the at least one local memory to a second location within at least one local memory, to reduce communication cost and memory traffic; and producing an optimized computer program for execution on the second computing apparatus. 2. The method of claim 1 , wherein the step of transforming the computer program is automatically performed by an optimizing compiler using a polyhedral representation. 3. The method of claim 2 , wherein producing a tiled variant of the program distinguishes between inter-tile dimensions and intra-tile dimensions. 4. The method of claim 3 , wherein a placement function determines assignment of a tile of inter-tile loops to processing elements. 5. The method of claim 4 , further comprising detecting opportunities for redundant transfer elimination. 6. The method of claim 5 , further comprising eliminating redundant transfers based on, at least in part, the placement function and dependence information of operations within the tile. 7. The method of claim 3 , wherein a grain of communication representing a data movement of the data movement operations is parameterized by the intra-tile dimensions. 8. The method of claim 7 , wherein redundant transfers are hoisted by at least one level in the loop nest. 9. The method of claim 1 , wherein a value stored in a local memory location addressable by at least two processing elements is reused to replace a transfer of that value from the main memory to the local memory. 10. The method of claim 1 , wherein read-after-read dependences carried by enclosing loops are computed to determine which values in local memory exhibit reuse opportunities. 11. The method of claim 1 , further comprising ordering the addresses accessed by transfers from main memory to increase the amount of reuse from local memory. 12. The method of claim 11 , further comprising introducing redundant communications between main and local memories when the redundant communications increase the amount of memory reuse within local memories. 13. The method of claim 1 , wherein values stored in private memory locations addressable by a single processing element are reused to replace transfers from main memory to local memory. 14. The method of claim 1 , wherein placement functions are embedded into the optimized code as parameters that represent an id of a processing element on which a portion of the optimized program is to execute. 15. The method of claim 14 , wherein rotation of values in registers is performed for values that are reused within the same processing elements. 16. The method of claim 15 , wherein rotation of code that performs memory transfers is performed for values that are reused by different processing elements with different ids. 17. The method of claim 1 , further comprising interchanging loops in data transfer code whose induction variables depend on selected processing element ids to reduce control flow overhead of the optimized program. 18. A custom computing apparatus comprising: at least one processor; a memory coupled to the at least one processor; and a storage medium coupled to the memory and the at least one processor the storage medium comprising a set of processor executable instructions sufficient that when executed by the at least one processor configure the custom computing apparatus to optimize a computer program for execution on a second computing apparatus, the computer program comprising a set of operations, the second computing apparatus comprising at least one main memory, at least one local memory, and at least one computation unit, each computation unit comprising at least one private memory region, the configuration comprising a configuration to: produce a tiled variant of the program; generate operations to perform data movements for elements produced and consumed by tiles according to the tiled variant, between the at least one main memory and the at least one local memory; optimize the data movement operations using at least one data reuse transform that eliminates at least one of the generated operations for data movement between the at least one main memory and the at least one local memory by copying data from a first location within the at least one local memory to a second location within at least one local memory, to reduce communication cost and memory traffic; and produce an optimized computer program for execution on the second computing apparatus. 19. The custom apparatus of claim 18 , wherein the optimization of the program is based on, at least in part, a polyhedral representation. 20. The custom apparatus of claim 19 , wherein the configuration to produce the tiled variant of the program distinguishes between inter-tile dimensions and intra-tile dimensions. 21. The custom apparatus of claim 20 , wherein the configuration is further configured to select placement function that determines assignment of a tile of inter-tile loops to processing elements. 22. The custom apparatus of claim 21 , wherein the configuration is further configured to detect opportunities for redundant transfer elimination. 23. The custom apparatus of claim 22 , wherein the configuration is further configured to eliminate redundant transfers based on, at least in part, the placement function and dependence information of operations within the tile. 24. The custom apparatus of claim 20 , wherein a grain of communication representing a data movement of the data movement operations is parameterized by the intra-tile dimensions. 25. The custom apparatus of claim 24 , wherein the configuration is further configured to hoist redundant transfers by at least one level in the loop nest. 26. The custom apparatus of claim 18 , wherein a value stored in a local memory location is addressable by at least two processing elements, and the configuration is further configured to optimize the program such that the value stored in the local memory is reused to replace a transfer of that value from the main memory to the local memory. 27. The custom apparatus of claim 18 , wherein the configuration is further configured to compute read-after-read dependences carried by enclosing loops to determine which values in local memory exhibit reuse opportunities. 28. The custom apparatus of claim 18 , wherein

Assignees

Inventors

Classifications

  • G06F8/453Primary

    Data distribution · CPC title

  • G06F8/41Primary

    Compilation · CPC title

  • Communication (intertask communication G06F9/54) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9830133B1 cover?
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution …
Who is the assignee on this patent?
Reservoir Labs Inc, Significs And Elements Llc
What technology area does this patent fall under?
Primary CPC classification G06F8/453. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 28 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).