What technology area does this patent fall under?

Primary CPC classification G06F9/3851. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Techniques for divergent thread group execution scheduling

US11934867B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11934867-B2
Application number	US-202117184420-A
Country	US
Kind code	B2
Filing date	Feb 24, 2021
Priority date	Jul 23, 2020
Publication date	Mar 19, 2024
Grant date	Mar 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Warp sharding techniques to switch execution between divergent shards on instructions that trigger a long stall, thereby interleaving execution between diverged threads within a warp instead of across warps. The technique may be applied to mitigate pipeline stalls in applications with low warp occupancy and high divergence. Warp data cache locality may also be improved by concentrating memory accesses within a warp rather than spreading them across warps.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a memory to store an application; and an execution scheduler configured to: generate at least one warp for the application, the at least one warp comprising a plurality of threads; and relinquish control from a first thread of the warp to a second thread of the warp on condition that: the first thread encounters a long stall hint instruction inserted in the first thread, triggering a run-time test; the run-time test detects a long stall; the first thread and the second thread are divergent threads; and wherein the long stall hint instruction is inserted in the first thread between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 2. The system of claim 1 , wherein the long stall is at a consumer location in the first thread for a result of an instruction generating a memory stall. 3. The system of claim 1 , wherein the control is relinquished via a YIELD instruction. 4. The system of claim 1 , wherein the control is relinquished via a JPC instruction. 5. The system of claim 1 , further comprising: the at least one warp comprising a different thread for each ray cast by a ray tracing application; and the execution scheduler generating a plurality of shards of the at least one warp, each shard consisting of threads of the plurality of threads executing a same shader module. 6. A system comprising: at least one graphics processing unit configured to sequentially execute different shards of a warp in a Single Instruction Multiple Thread (SIMT) manner, each shard comprising one or more threads of the warp; and logic to: detect a long stall hint instruction in a first shard of the different shards; and in response to detecting the long stall hint instruction: perform a runtime test for a long stall in the first shard; on condition that the long stall is detected: suspend execution of the first shard; select a second shard of the different shards for execution, the second shard selected at least in part for being divergent from the first shard; initiate execution of the second shard; wherein the long stall hint instruction is detected in the first shard between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 7. The system of claim 6 , wherein the long stall hint instruction comprises a pre-execution hint inserted at a consumer location of a result of a long-latency instruction. 8. The system of claim 7 , further comprising logic to test whether an instruction generates the long stall upon encountering the long stall hint instruction during execution of the first shard. 9. The system of claim 6 , further comprising: a JPC instruction configured to cause the at least one graphics processing unit to suspend execution of the first shard and initiate execution of the second shard. 10. The system of claim 6 , further comprising: a YIELD instruction configured to cause the at least one graphics processing unit to suspend execution of the first shard and initiate execution of the second shard. 11. The system of claim 10 , wherein the YIELD instruction is configured to leave configured execution barriers unaltered. 12. A method comprising: executing a single instruction multi-thread (SIMT) application as a plurality of threads; splitting the execution of the plurality of threads into sequentially executed groups of threads in response to execution divergence among the plurality of threads, each of the groups executed in a SIMT manner, wherein each of the groups is a shard; detecting a long stall in a first one of the shards, the detecting triggered by execution of a compiler-generated hint instruction in the first one of the shards; in response to detecting the long stall in the first one of the shards, switching execution from the first one of the shards to a second one of the shards; and wherein the compiler-generated hint instruction is detected in the first one of the shards between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 13. The method of claim 12 , wherein the second one of the shards to execute is selected by the first one of the shards. 14. The method of claim 12 , wherein the second one of the shards to execute is selected from a worklist by a hardware scheduler. 15. The method of claim 12 , wherein splitting the execution of the plurality of threads into the sequentially executed groups of threads further comprises copying shared state values of the plurality of threads to local memory locations for each of the groups of threads. 16. A system comprising: a memory; a processing cluster comprising a plurality of hardware units; and a graphics processing unit to: execute an application stored in the memory in parallel on the plurality of hardware units, wherein on each of the hardware units the application is executed in a single instruction multi-threaded (SIMT) manner as a plurality of threads; on each of the hardware units, split the execution of the plurality of threads into sequentially executed groups of threads upon the occurrence of thread divergence, each of the groups being a shard; detect a long stall in any one of the shards as a result of encountering a long stall hint instruction in the any one of the shards; in response to detecting the long stall in any one of the shards, switch execution to a different one of the shards on a same hardware unit of the plurality of hardware units; wherein the long stall hint instruction is encountered between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 17. The system of claim 16 , wherein the long stall hint comprises a pre-execution hint inserted at a location of a consumer of a long-latency instruction. 18. The system of claim 17 , further comprising logic to test whether a memory read instruction generates the long stall upon encountering the long stall hint instruction. 19. The system of claim 16 , further comprising: a JPC instruction configured to cause the processing cluster to switch the execution. 20. The system of claim 16 , further comprising: a YIELD instruction configured to cause the processing cluster to switch the execution of the shards. 21. A method comprising: executing a single instruction multi-thread (SIMT) application as a plurality of threads; splitting the execution of the plurality of threads into sequentially executed groups of threads in response to execution divergence among the plurality of threads, each of the groups executed in a SIMT manner, wherein each of the groups is a shard; detecting a long stall in a first one of the shards, the detecting triggered by execution of a compiler-generated hint instruction in the first one of the shards; in response to detecting the long stall in the first one of the shards, switching execution from the first one of the shards to a second one of the shards; and wherein the compiler-generated hint instruction is executed in the first one of the shards between a long-latency producer instruction and a consumer instruction that utilizes a result of the long-latency producer instruction. 22. The method of claim 21 , wherein the second one of the shards to execute is selected by the first one of the shards. 23. The method

Assignees

Nvidia Corp

Inventors

Classifications

G06F9/3851Primary
from multiple instruction streams, e.g. multistreaming · CPC title
G06F9/3888
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
G06F9/4881Primary
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
G06F9/3009
Thread control instructions · CPC title
G06F9/522
Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

View patent family 79689354

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11934867B2 cover?: Warp sharding techniques to switch execution between divergent shards on instructions that trigger a long stall, thereby interleaving execution between diverged threads within a warp instead of across warps. The technique may be applied to mitigate pipeline stalls in applications with low warp occupancy and high divergence. Warp data cache locality may also be improved by concentrating memory a…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06F9/3851. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).