Hardware accelerated dynamic work creation on a graphics processing unit
US-2020089528-A1 · Mar 19, 2020 · US
US10915328B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10915328-B2 |
| Application number | US-201816220528-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 14, 2018 |
| Priority date | Dec 14, 2018 |
| Publication date | Feb 9, 2021 |
| Grant date | Feb 9, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus and method for offloading iterative, parallel work to a data parallel cluster. For example, one embodiment of a processor comprises: a host processor to execute a primary thread; a data parallel cluster coupled to the host processor over a high speed interconnect, the data parallel cluster comprising a plurality of execution lanes to perform parallel execution of one or more secondary threads related to the primary thread; and a data parallel cluster controller integral to the host processor to offload processing of the one or more secondary threads to the data parallel cluster in response to one of the cores executing a parallel processing call instruction from the primary thread.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: a host processor comprising a plurality of cores, a first one or more of the plurality of cores to execute a primary thread comprising a sequence of instructions; a data parallel cluster coupled to the host processor over a high speed interconnect, the data parallel cluster comprising a plurality of execution lanes to perform parallel execution of one or more secondary threads related to the primary thread, the data parallel cluster including a scheduler to evaluate variables associated with the one or more secondary threads to schedule execution of the one or more secondary threads across the plurality of execution lanes; and a data parallel cluster controller integral to the host processor to offload processing of the one or more secondary threads to the data parallel cluster in response to one of the cores executing a parallel processing call instruction from the primary thread, wherein responsive to the parallel processing call instruction, the data parallel cluster controller is to transmit initial execution values to the data parallel cluster including a context identifier and a number of loop iterations to be performed during execution of the secondary threads on the execution lanes. 2. The apparatus of claim 1 wherein at least one of the execution lanes are to execute a parallel processing return instruction, wherein responsive to the parallel processing return instruction, the data parallel cluster is to transmit a notification to the host processor over the high speed interconnect. 3. The apparatus of claim 2 wherein the data parallel cluster further comprises a memory controller to access a designated region of system memory allocated by the host processor, the data parallel cluster to store results of the secondary thread in the designated region. 4. The apparatus of claim 3 wherein the notification to the host processor includes a pointer identifying a location of the results in the designated region. 5. The apparatus of claim 1 wherein the data parallel cluster controller is to transmit a pointer to the data parallel cluster over the high speed interconnect identifying a location in system memory from which the data parallel cluster is to fetch instructions of the secondary threads. 6. The apparatus of claim 1 wherein the secondary threads comprise microthreads comprising sequences of microoperations. 7. The apparatus of claim 1 wherein the secondary threads comprise sequences of macroinstructions, the data parallel cluster further comprising an instruction fetch unit to fetch the macroinstructions and a decoder to decode the macroinstructions into a plurality of microoperations. 8. The apparatus of claim 1 wherein parallel work resulting from the secondary threads is to be subdivided according to how the instructions of the secondary threads express parallel execution of a loop. 9. The apparatus of claim 1 wherein the secondary threads are ganged into fragments based on instruction pointer values to induce convergence. 10. The apparatus of claim 9 wherein a fragment comprises a collection of associated threads. 11. The apparatus of claim 9 wherein an order in which to execute the fragments is determined based on variables associated with each fragment. 12. A method comprising: executing a sequence of instructions of a primary thread on execution resources of a host processor; executing a parallel processing call executed on the execution resources and responsively offloading execution of one or more secondary threads to a data parallel cluster coupled to the host processor over a high speed interconnect; passing initialization values to the data parallel cluster including a thread context identifier and a number of loop iterations; scheduling execution of the secondary threads on a plurality of lanes of the data parallel cluster; executing the secondary threads on the data parallel cluster, implementing the loop iterations; and storing results of the execution of the secondary threads in a designated of region of memory configured by the host processor. 13. The method of claim 12 wherein at least one of the execution lanes are to execute a parallel processing return instruction, wherein responsive to the parallel processing return instruction, the data parallel cluster is to transmit a notification to the host processor over the high speed interconnect. 14. The method of claim 13 wherein the data parallel cluster further comprises a memory controller to access a designated region of system memory allocated by the host processor, the data parallel cluster to store results of the secondary thread in the designated region. 15. The method of claim 14 wherein the notification to the host processor includes a pointer identifying a location of the results in the designated region. 16. The method of claim 12 wherein a pointer to the data parallel cluster is to be transmitted over the high speed interconnect identifying a location in system memory from which the data parallel cluster is to fetch instructions of the secondary threads. 17. The method of claim 12 wherein the secondary threads comprise microthreads comprising sequences of microoperations. 18. The method of claim 12 wherein the secondary threads comprise sequences of macroinstructions, the data parallel cluster further comprising an instruction fetch unit to fetch the macroinstructions and a decoder to decode the macroinstructions into a plurality of microoperations. 19. The method of claim 12 wherein parallel work resulting from the secondary threads is subdivided according to how the instructions of the secondary threads express parallel execution of a loop. 20. The method of claim 12 wherein the secondary threads are ganged into fragments based on instruction pointer values to induce convergence. 21. The method of claim 20 wherein a fragment comprises a collection of associated threads. 22. The method of claim 20 wherein an order in which to execute the fragments is to be based on variables associated with each fragment. 23. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: executing a sequence of instructions of a primary thread on execution resources of a host processor; executing a parallel processing call executed on the execution resources and responsively offloading execution of one or more secondary threads to a data parallel cluster; passing initialization values to the data parallel cluster including a thread context identifier and a number of loop iterations; scheduling execution of the secondary threads on a plurality of lanes of the data parallel cluster coupled to the host processor over a high speed interconnect; executing the secondary threads on the data parallel cluster, implementing the loop iterations; and storing results of the execution of the secondary threads in a designated of region of memory configured by the host processor. 24. The non-transitory machine-readable medium of claim 23 wherein at least one of the execution lanes are to execute a parallel processing return instruction, wherein responsive to the parallel processing return instruction, the data parallel cluster is to transmit a notification to the host processor over the high speed interconnect. 25. The non-transitory machine-readable medium of claim 24 wherein the data parall
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
Divergence aspects · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.