Automatic Detection of Optimal Compute Unit Partitioning
US-2018232260-A1 · Aug 16, 2018 · US
US11410027B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11410027-B2 |
| Application number | US-201916572527-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 16, 2019 |
| Priority date | Sep 16, 2019 |
| Publication date | Aug 9, 2022 |
| Grant date | Aug 9, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The technology disclosed relates to allocating available physical compute units (PCUs) and/or physical memory units (PMUs) of a reconfigurable data processor to operation units of an operation unit graph for execution thereof. In particular, it relates to selecting, for evaluation, an intermediate stage compute processing time between lower and upper search bounds of a generic stage compute processing time, determining a pipeline number of the PCUs and/or the PMUs required to process the operation unit graph, and iteratively, initializing new lower and upper search bounds of the generic stage compute processing time and selecting, for evaluation in a next iteration, a new intermediate stage compute processing time taking into account whether the pipeline number of the PCUs and/or the PMUs produced for a prior intermediate stage compute processing time in a previous iteration is lower or higher than the available PCUs and/or PMUs.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of allocating available physical compute units and/or physical memory units of a reconfigurable data processor to operation units of an operation unit graph for execution thereof, the method including: initializing lower and upper search bounds of generic stage compute processing time required for executing an operation unit of the operation unit graph; selecting, for evaluation, an intermediate stage compute processing time between the lower and upper search bounds of the generic stage compute processing time; determining a pipeline number of the physical compute units and/or the physical memory units required to process a pipeline compute load of the operation unit graph on the reconfigurable data processor, including for each of the operation units of the operation unit graph, determining a specific stage compute processing time required to process a stage compute load of a respective one of the operation units using only one physical compute unit and/or only one physical memory unit, and determining a stage number of the physical compute units and/or the physical memory units required to process the stage compute load of the respective one of the operation units by dividing the specific stage compute processing time with the intermediate stage compute processing time; and summing the stage number of the physical compute units and/or the physical memory units for each of the operation units and producing the pipeline number of the physical compute units and/or the physical memory units; iteratively, initializing new lower and upper search bounds of the generic stage compute processing time and selecting, for evaluation in a next iteration, a new intermediate stage compute processing time between the new lower and upper search bounds of the generic stage compute processing time, taking into account whether the pipeline number of the physical compute units and/or the physical memory units produced for a prior intermediate stage compute processing time in a previous iteration is lower or higher than the available physical compute units and/or physical memory units; terminating the iterative initializing and selecting when the pipeline number of the physical compute units and/or the physical memory units produced for a current intermediate stage compute processing time in a current iteration meets a convergence criteria; allocating the available physical compute units and/or physical memory units to the operation units of the operation unit graph based on the current intermediate stage compute processing time; and executing the operation units of the operation unit graph on the reconfigurable data processor based on the allocation. 2. The computer-implemented method of claim 1 , wherein the convergence criteria is that a difference between the upper search bound and the lower search bound is below a threshold. 3. The computer-implemented method of claim 1 , wherein the lower search bound of the generic stage compute processing time is based on maximum utilization of the reconfigurable data processor and determined by dividing the pipeline compute load of the operation unit graph with total processing capacity of the reconfigurable data processor. 4. The computer-implemented method of claim 3 , wherein the pipeline compute load of the operation unit graph is determined by a total number of floating point operations (FLOP) required to execute the operation unit graph. 5. The computer-implemented method of claim 3 , wherein the total processing capacity of the reconfigurable data processor is determined by a maximum number of FLOP executable by the reconfigurable data processor per second (FLOP/s). 6. The computer-implemented method of claim 3 , wherein the upper search bound of the generic stage compute processing time is based on multiplying the lower search bound of the generic stage compute processing time with a minimum utilization factor. 7. The computer-implemented method of claim 6 , wherein the minimum utilization factor is one hundred. 8. The computer-implemented method of claim 2 , further including continuing the iterative initializing and selecting as long as the difference between the upper search bound and the lower search bound is above a threshold. 9. The computer-implemented method of claim 1 , wherein the intermediate stage compute processing time is an average of the lower and upper search bounds of the generic stage compute processing time. 10. The computer-implemented method of claim 9 , when the pipeline number of the physical compute units and/or the physical memory units produced for the prior intermediate stage compute processing time in the previous iteration is lower than the available physical compute units and/or physical memory units, further including: setting the new upper search bound for the next iteration as the prior intermediate stage compute processing time. 11. The computer-implemented method of claim 9 , when the pipeline number of the physical compute units and/or the physical memory units produced for the prior intermediate stage compute processing time in the previous iteration is higher than the available physical compute units and/or physical memory units, further including: setting the new lower search bound for the next iteration as the prior intermediate stage compute processing time. 12. The computer-implemented method of claim 1 , wherein the stage compute load of the respective one of the operation units is specified as a total number of FLOP required to execute the respective one of the operation units and is determined based on an operation type. 13. The computer-implemented method of claim 1 , wherein the stage compute load of the respective one of the operation units is specified as a total number of FLOP required to execute the respective one of the operation units and is determined based on input dimensionality, and output dimensionality. 14. The computer-implemented method of claim 1 , further including determining the stage number of the physical compute units and/or the physical memory units required to process the stage compute load by rounding up to an integer a result of dividing the stage compute processing time with the intermediate stage compute processing time. 15. The computer-implemented method of claim 1 , further including determining a throughput value based on the current intermediate stage compute processing time. 16. The computer-implemented method of claim 1 , further including determining a pipeline compute processing time required for executing the operation unit graph based on multiplying a number of the operation units of the operation unit graph with the current intermediate stage compute processing time. 17. The computer-implemented method of claim 1 , further including: selecting those operation units of the operation unit graph whose stage compute processing time is relatively greater than most other operation units of the operation unit graph; and allocating additional available physical compute units and/or the physical memory units to the selected operation units. 18. The computer-implemented method of claim 17 , wherein the allocation results in each of the operation units of the operation unit graph having substantially matching stage compute processing time. 19. The computer-implemented method of claim 1 , wherein the operation unit graph is a fused operation unit graph with at least one fused operation unit. 20. The computer-implemented method
Combinations of networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions · CPC title
using electronic means · CPC title
Browsing; Visualisation therefor (for navigating the web G06F16/954; browsing optimisation for the web G06F16/957) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.