Hardware accelerated dynamic work creation on a graphics processing unit
US-2020089528-A1 · Mar 19, 2020 · US
US11093250B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11093250-B2 |
| Application number | US-201816147694-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 29, 2018 |
| Priority date | Sep 29, 2018 |
| Publication date | Aug 17, 2021 |
| Grant date | Aug 17, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus and method for efficiently processing invariant operations on a parallel execution engine. For example, one embodiment of a processor comprises: a plurality of parallel execution lanes comprising execution circuitry and registers to concurrently execute a plurality of threads; front end circuitry coupled to the plurality of parallel execution lanes, the front end circuitry to arrange the threads into parallel execution groups and schedule operations of the threads to be executed across the parallel execution lanes, wherein the front end circuitry is to dynamically evaluate one or more variables associated with the operations to determine if one or more conditionally invariant operations will be invariant across threads of a parallel execution group and/or across the parallel execution lanes; a scheduler of the front end circuitry to responsively schedule a shared thread upon a determination that a conditionally invariant operation will be invariant across threads of a parallel execution group and/or across the parallel execution lanes.
Opening claim text (preview).
What is claimed is: 1. A processor comprising: a plurality of parallel execution lanes comprising execution circuitry and registers to concurrently execute a plurality of threads; front end circuitry coupled to the plurality of parallel execution lanes, the front end circuitry to arrange the threads into parallel execution groups and schedule operations of the threads to be executed across the parallel execution lanes, wherein the front end circuitry is to dynamically evaluate one or more variables associated with the operations to determine if one or more of the operations are conditionally invariant, and the determination comprises determining whether the one or more of the operations will be invariant across threads of a plurality of parallel execution groups within a same lane but not across threads of the plurality of parallel execution groups in multiple lanes, to produce a same variable value across the threads of the plurality of parallel execution groups; a scheduler of the front end circuitry to responsively schedule a shared thread upon a determination that a conditionally invariant operation will be invariant across the threads of the plurality of parallel execution groups; and a first parallel execution lane to execute the shared thread to generate execution results and to share the execution results across other threads of the plurality of parallel execution groups. 2. The processor of claim 1 further comprising: a first set of registers in a first parallel execution lane to store the execution results; and data distribution circuitry to broadcast one or more of the execution results to additional sets of registers within the first parallel execution lane. 3. The processor of claim 1 wherein dynamically evaluating the one or more variables comprises determining whether input values to the conditionally invariant operation will be identical across the threads of the plurality of parallel execution groups. 4. The processor of claim 1 wherein the scheduler is to cause one or more threads to wait for the execution of the shared thread to complete. 5. The processor of claim 1 wherein the threads are microthreads comprising a plurality of microoperations. 6. The processor of claim 5 wherein the front end circuitry further comprises a decoder to generate the microthreads responsive to decoding a plurality of macroinstructions. 7. The processor of claim 5 wherein the front end circuitry is to arrange the microthreads into the parallel execution groups based on instruction pointer values to induce microthread convergence. 8. The processor of claim 1 further comprising: mask storage to store an execution mask having at least one value associated with each parallel execution lane, wherein the front end circuitry is to enable or disable one or more of the parallel execution lanes based on the values associated with the lanes. 9. A method comprising: arranging a plurality of threads into parallel execution groups for execution on a plurality of parallel execution lanes, the threads comprising operations to be executed by execution circuitry within each of the parallel execution lanes; dynamically evaluating one or more variables associated with the operations to determine if one or more of the operations are conditionally invariant, and the determination comprises determining whether the one or more of the operations will be invariant across threads of a plurality of parallel execution groups within a same lane but not across threads of the plurality of parallel execution groups in multiple lanes to produce a same variable value across the threads of the plurality of parallel execution groups; scheduling a shared thread upon a determination that a conditionally invariant operation will be invariant across the threads of the plurality of parallel execution groups; and executing the shared thread to generate execution results and to share the execution results across other threads of the plurality of parallel execution groups. 10. The method of claim 9 further comprising: storing the execution results in a first set of registers in a first parallel execution lane; and broadcasting one or more of the execution results to additional sets of registers within the first parallel execution lane. 11. The method of claim 9 wherein dynamically evaluating one or more variables comprises determining whether input values to the conditionally invariant operation will be identical across the threads of the plurality of parallel execution groups. 12. The method of claim 9 further comprising: causing one or more threads to wait for the execution of the shared thread to complete. 13. The method of claim 9 wherein the threads are microthreads comprising a plurality of microoperations. 14. The method of claim 13 further comprising: generating the microthreads responsive to decoding a plurality of macroinstructions. 15. The method of claim 13 further comprising: arranging the microthreads into the parallel execution groups based on instruction pointer values to induce microthread convergence. 16. The method of claim 9 further comprising: storing an execution mask having at least one value associated with each parallel execution lane; and enabling or disabling one or more of the parallel execution lanes based on the values associated with the lanes in the execution mask. 17. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: arranging a plurality of threads into parallel execution groups for execution on a plurality of parallel execution lanes, the threads comprising operations to be executed by execution circuitry within each of the parallel execution lanes; dynamically evaluating one or more variables associated with the operations to determine if one or more of the operations are conditionally invariant, and the determination comprises determining whether the one or more of the operations will be invariant across threads of a plurality of parallel execution groups within a same lane but not across threads of the plurality of parallel execution groups in multiple lanes to produce a same variable value across the threads of the plurality of parallel execution groups; scheduling a shared thread upon a determination that a conditionally invariant operation will be invariant across the threads of the plurality of parallel execution groups; and executing the shared thread to generate execution results and to share the execution results across other threads of the plurality of parallel execution groups. 18. The non-transitory machine-readable medium of claim 17 further comprising program code to cause the machine to perform the operations of: storing the execution results in a first set of registers in a first parallel execution lane; and broadcasting one or more of the execution results to additional sets of registers within the first parallel execution lane. 19. The non-transitory machine-readable medium of claim 17 wherein dynamically evaluating one or more variables comprises determining whether input values to the conditionally invariant operation will be identical across the threads of the plurality of parallel execution groups. 20. The non-transitory machine-readable medium of claim 17 further comprising program code to cause the machine to perform the operations of: causing one or more threads to wait for the execution of the shared thread to complete. 21.
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
by program, e.g. task dispatcher, supervisor, operating system · CPC title
organised in groups of units sharing resources, e.g. clusters · CPC title
with global bypass, e.g. between pipelines, between clusters · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.