Multi-GPU frame rendering
US-10430915-B2 · Oct 1, 2019 · US
US11934867B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11934867-B2 |
| Application number | US-202117184420-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 24, 2021 |
| Priority date | Jul 23, 2020 |
| Publication date | Mar 19, 2024 |
| Grant date | Mar 19, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Warp sharding techniques to switch execution between divergent shards on instructions that trigger a long stall, thereby interleaving execution between diverged threads within a warp instead of across warps. The technique may be applied to mitigate pipeline stalls in applications with low warp occupancy and high divergence. Warp data cache locality may also be improved by concentrating memory accesses within a warp rather than spreading them across warps.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a memory to store an application; and an execution scheduler configured to: generate at least one warp for the application, the at least one warp comprising a plurality of threads; and relinquish control from a first thread of the warp to a second thread of the warp on condition that: the first thread encounters a long stall hint instruction inserted in the first thread, triggering a run-time test; the run-time test detects a long stall; the first thread and the second thread are divergent threads; and wherein the long stall hint instruction is inserted in the first thread between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 2. The system of claim 1 , wherein the long stall is at a consumer location in the first thread for a result of an instruction generating a memory stall. 3. The system of claim 1 , wherein the control is relinquished via a YIELD instruction. 4. The system of claim 1 , wherein the control is relinquished via a JPC instruction. 5. The system of claim 1 , further comprising: the at least one warp comprising a different thread for each ray cast by a ray tracing application; and the execution scheduler generating a plurality of shards of the at least one warp, each shard consisting of threads of the plurality of threads executing a same shader module. 6. A system comprising: at least one graphics processing unit configured to sequentially execute different shards of a warp in a Single Instruction Multiple Thread (SIMT) manner, each shard comprising one or more threads of the warp; and logic to: detect a long stall hint instruction in a first shard of the different shards; and in response to detecting the long stall hint instruction: perform a runtime test for a long stall in the first shard; on condition that the long stall is detected: suspend execution of the first shard; select a second shard of the different shards for execution, the second shard selected at least in part for being divergent from the first shard; initiate execution of the second shard; wherein the long stall hint instruction is detected in the first shard between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 7. The system of claim 6 , wherein the long stall hint instruction comprises a pre-execution hint inserted at a consumer location of a result of a long-latency instruction. 8. The system of claim 7 , further comprising logic to test whether an instruction generates the long stall upon encountering the long stall hint instruction during execution of the first shard. 9. The system of claim 6 , further comprising: a JPC instruction configured to cause the at least one graphics processing unit to suspend execution of the first shard and initiate execution of the second shard. 10. The system of claim 6 , further comprising: a YIELD instruction configured to cause the at least one graphics processing unit to suspend execution of the first shard and initiate execution of the second shard. 11. The system of claim 10 , wherein the YIELD instruction is configured to leave configured execution barriers unaltered. 12. A method comprising: executing a single instruction multi-thread (SIMT) application as a plurality of threads; splitting the execution of the plurality of threads into sequentially executed groups of threads in response to execution divergence among the plurality of threads, each of the groups executed in a SIMT manner, wherein each of the groups is a shard; detecting a long stall in a first one of the shards, the detecting triggered by execution of a compiler-generated hint instruction in the first one of the shards; in response to detecting the long stall in the first one of the shards, switching execution from the first one of the shards to a second one of the shards; and wherein the compiler-generated hint instruction is detected in the first one of the shards between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 13. The method of claim 12 , wherein the second one of the shards to execute is selected by the first one of the shards. 14. The method of claim 12 , wherein the second one of the shards to execute is selected from a worklist by a hardware scheduler. 15. The method of claim 12 , wherein splitting the execution of the plurality of threads into the sequentially executed groups of threads further comprises copying shared state values of the plurality of threads to local memory locations for each of the groups of threads. 16. A system comprising: a memory; a processing cluster comprising a plurality of hardware units; and a graphics processing unit to: execute an application stored in the memory in parallel on the plurality of hardware units, wherein on each of the hardware units the application is executed in a single instruction multi-threaded (SIMT) manner as a plurality of threads; on each of the hardware units, split the execution of the plurality of threads into sequentially executed groups of threads upon the occurrence of thread divergence, each of the groups being a shard; detect a long stall in any one of the shards as a result of encountering a long stall hint instruction in the any one of the shards; in response to detecting the long stall in any one of the shards, switch execution to a different one of the shards on a same hardware unit of the plurality of hardware units; wherein the long stall hint instruction is encountered between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 17. The system of claim 16 , wherein the long stall hint comprises a pre-execution hint inserted at a location of a consumer of a long-latency instruction. 18. The system of claim 17 , further comprising logic to test whether a memory read instruction generates the long stall upon encountering the long stall hint instruction. 19. The system of claim 16 , further comprising: a JPC instruction configured to cause the processing cluster to switch the execution. 20. The system of claim 16 , further comprising: a YIELD instruction configured to cause the processing cluster to switch the execution of the shards. 21. A method comprising: executing a single instruction multi-thread (SIMT) application as a plurality of threads; splitting the execution of the plurality of threads into sequentially executed groups of threads in response to execution divergence among the plurality of threads, each of the groups executed in a SIMT manner, wherein each of the groups is a shard; detecting a long stall in a first one of the shards, the detecting triggered by execution of a compiler-generated hint instruction in the first one of the shards; in response to detecting the long stall in the first one of the shards, switching execution from the first one of the shards to a second one of the shards; and wherein the compiler-generated hint instruction is executed in the first one of the shards between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 22. The method of claim 21 , wherein the second one of the shards to execute is selected by the first one of the shards. 23. The method
from multiple instruction streams, e.g. multistreaming · CPC title
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
Thread control instructions · CPC title
Barrier synchronisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.