Apparatus and method for low-latency invocation of accelerators
US-2016246597-A1 · Aug 25, 2016 · US
US10664284B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10664284-B2 |
| Application number | US-201916289075-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 28, 2019 |
| Priority date | Dec 28, 2012 |
| Publication date | May 26, 2020 |
| Grant date | May 26, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus and method are described for executing both latency-optimized execution logic and throughput-optimized execution logic on a processing device. For example, a processor according to one embodiment comprises: latency-optimized execution logic to execute a first type of program code; throughput-optimized execution logic to execute a second type of program code, wherein the first type of program code and the second type of program code are designed for the same instruction set architecture; logic to identify the first type of program code and the second type of program code within a process and to distribute the first type of program code for execution on the latency-optimized execution logic and the second type of program code for execution on the throughput-optimized execution logic.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: a set of latency clusters to execute a main program code, the main program code comprising both latency program code and throughput program code, wherein a current point of execution in the main program code is identified by a primary instruction pointer; a set of throughput clusters comprising one or more processing elements to execute the throughput program code in the main program code; and wherein upon detecting the current point of execution for the main program code reaching a first throughput program code in the main program code, a front end unit of the set of throughput clusters is to distribute the first throughput program code to the one or more processing elements in the set of throughput clusters for execution. 2. The apparatus as in claim 1 , wherein upon detecting the current point of execution for the main program code reaching a first throughput program code in the main program code, an XCALL instruction is executed by the set of latency clusters to trigger the front end unit to distribute the first throughput program code. 3. The apparatus as in claim 2 , wherein the XCALL instruction is to identify a result register to store results of executing the XCALL instruction, a command register to store one or more commands from the first throughput program code to be executed, and a parameter register to store parameters for executing the one or more commands. 4. The apparatus as in claim 3 , wherein responsive to executing of the XCALL instruction, the front end unit is to retrieve a first command of the one or more commands from the command register and associated parameters from the parameter register and distribute the first command and the associated parameters for execution by the one or more processing elements. 5. The apparatus as in claim 1 , wherein one or more processing elements in the set of throughput clusters are capable of simultaneously multithreading by simultaneously executing multiple micro-threads. 6. The apparatus as in claim 5 , wherein each of the micro-threads includes a respective micro-instruction pointer used by the processing elements to maintain a current point of micro-thread execution. 7. The apparatus as in claim 5 , wherein one or more processing elements in the set of throughput clusters are homogeneous processing elements, each capable of executing any one of the micro-threads. 8. The apparatus as in claim 5 , wherein one or more processing elements in the set of throughput clusters are heterogeneous processing elements, each designed to execute specific types of micro-threads. 9. A method comprising: executing a main program code on a set of latency clusters, the main program code comprising both latency program code and throughput program code, wherein a current point of execution in the main program code is identified by a primary instruction pointer; and detecting the current point of execution for the main program code reaching a first throughput program code in the main program code and responsively distributing the first throughput program code to one or more processing elements in a set of throughput clusters for execution. 10. The method as in claim 9 , wherein responsively distributing the first throughput program code to one or more processing elements in the set of throughput clusters for execution further comprises executing an XCALL instruction by the set of latency clusters. 11. The method as in claim 10 , wherein the XCALL instruction is to identify a result register to store results of executing of the XCALL instruction, a command register to store one or more commands from the first throughput program code to be executed, and a parameter register to store parameters for executing the one or more commands. 12. The method as in claim 11 , wherein executing the XCALL instruction further comprises: retrieving a first command of the one or more commands from the command register and associated parameters from the parameter register; and distributing the first command and the associated parameters for execution by the one or more processing elements of the set of throughput clusters. 13. The method as in claim 9 , wherein one or more processing elements in the set of throughput clusters are capable of simultaneously multithreading by simultaneously executing multiple micro-threads. 14. The method as in claim 13 , wherein each of the micro-threads includes a respective micro-instruction pointer used by the processing elements to maintain a current point of micro-thread execution. 15. The method as in claim 13 , wherein one or more processing elements in the set of throughput clusters are homogeneous processing elements, each capable of executing any one of the micro-threads. 16. The method as in claim 13 , wherein one or more processing elements in the set of throughput clusters are heterogeneous processing elements, each designed to execute specific types of micro-threads.
using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title
Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title
with reconfigurable architecture · CPC title
Recovery, e.g. branch miss-prediction, exception handling (error detection or correction G06F11/00) · CPC title
Reconfigurable logic embedded in CPU, e.g. reconfigurable unit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.