Cooperative group arrays

US12333311B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12333311-B2
Application numberUS-202217691621-A
CountryUS
Kind codeB2
Filing dateMar 10, 2022
Priority dateMar 10, 2022
Publication dateJun 17, 2025
Grant dateJun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.

First claim

Opening claim text (preview).

The invention claimed is: 1. A processing system comprising: a work distributor hardware circuit configured to launch a collection of thread groups on a set of plural processors while providing a hardware-based guarantee that all thread groups of the collection can be launched at the same time; the work distributor hardware circuit being further configured to speculatively launch the thread groups in the collection to confirm that the thread groups are able to launch and/or run concurrently on the set of plural processors before launching any of the thread groups in the collection. 2. The processing system of claim 1 wherein the work distributor hardware circuit comprises a multilevel hardware circuit configured to distribute the collection of thread groups to processors in a predefined hardware cluster. 3. The processing system of claim 1 wherein the set of processors comprise a predefined hardware domain and the work distributor hardware circuit is configured to launch the thread groups on any processor(s) within the predefined hardware domain. 4. The processing system of claim 3 wherein the predefined hardware domain comprises a GPU, a μGPU, a GPC or a TPC. 5. The processing system of claim 3 wherein the predefined hardware domain comprises a nested hierarchy of processors, and the work distributor hardware circuit is configured to schedule the thread groups to execute concurrently on processors in different levels of the nested hierarchy of processors. 6. The processing system of claim 1 wherein the collection of thread groups comprises a cooperative group array representable as a multidimensional grid. 7. The processing system of claim 1 wherein the work distributor hardware circuit is further configured to broadcast a grid launch packet to the plural processors. 8. A method of executing instructions on at least one processing system comprising: determining a cooperative group array of plural thread blocks, each thread block comprising plural threads; speculatively launching the cooperative group array of thread blocks to determine whether the plural thread blocks will be able to execute concurrently on plural parallel processors; and when the speculative launching reveals the cooperative group array of thread blocks will be able to execute concurrently on plural parallel processors, launching the cooperative group array of thread blocks on the plural parallel processors. 9. The method of claim 8 wherein the launching throttles, at a hardware level, shared memory usage of the concurrently launched cooperative group array of thread blocks. 10. The method of claim 8 wherein the plural parallel processors comprise streaming multiprocessors. 11. The method of claim 8 wherein the cooperative group array is determined based on a grid. 12. The method of claim 8 wherein the respective plural processors are all within a same hardware domain associated with the cooperative group array. 13. The method of claim 12 wherein the hardware domain comprises a GPC, a μGPU, a GPC, or a TPC. 14. A processing system comprising: a memory storing a cooperative group array (CGA) comprising plural cooperative thread arrays; and a work distributor hardware circuit configured to provide a hardware-based guarantee that the plural cooperative thread arrays of the CGA can be launched concurrently, the work distributor hardware circuit being configured to speculatively launch the cooperative group array across multiple processing cores to determine whether the cooperative group array can launch concurrently, the work distributor hardware circuit actually launching the cooperative group array across the multiple processing cores only if the speculative launch determines that all cooperative thread arrays can launch concurrently. 15. The processing system of claim 14 wherein the processing cores comprise streaming multiprocessors. 16. The processing system of claim 14 wherein the work distributor hardware circuit comprises registers, combinatorial logic and a hardware state machine. 17. The processing system of claim 14 wherein the work distributor hardware circuit comprises a multi-level work distribution architecture to provide CGA launch on associated hardware affinity/domain and support nesting of multiple levels of CGAs. 18. The processing system of claim 14 wherein the work distributor hardware circuit comprises a load balancer, resource trackers, a TPC enable table, a local memory (LMEM) block index table, credit counters, a task table, and a priority-sorted task table. 19. The processing system of claim 14 wherein the work distributor hardware circuit is configured to receive a launch command specifying a CGA grid, including an enumeration of various dimensions of composite thread blocks and CGAs within the specified CGA grid. 20. The processing system of claim 14 wherein the work distributor hardware circuit is configured to query and launch CTAs from multiple CGAs but works on one CGA at a time.

Assignees

Inventors

Classifications

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • Thread control instructions · CPC title

  • Buffers; Shared memory; Pipes · CPC title

  • Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title

  • Offload · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12333311B2 cover?
A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and r…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3888. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).