Programmatically controlled data multicasting across multiple compute engines

US12020035B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12020035-B2
Application numberUS-202217691288-A
CountryUS
Kind codeB2
Filing dateMar 10, 2022
Priority dateMar 10, 2022
Publication dateJun 25, 2024
Grant dateJun 25, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.

First claim

Opening claim text (preview).

The invention claimed is: 1. A processing system comprising: a plurality of processors; a distributed shared memory comprising a plurality of distributed shared memory areas, wherein each processor of the plurality of processors is locally connected to a respective one of the plurality of distributed shared memory areas, wherein the plurality of processors are configured to simultaneously execute a plurality of threads, one of the threads executing on a first of the plurality of processors generating a memory access request for data for one or more other threads of the threads executing on one or more second ones of the plurality of processors; and packet distribution circuitry configured to route to each processor of the one or more second ones of the plurality of processors, for storage in the distributed shared memory locally connected to said each processor, a respective portion of response data received in response to the memory access request. 2. The processing system according to claim 1 further comprising a memory interface circuitry, wherein the memory interface circuitry is configured to transmit the memory access request to a memory hierarchy including a cache memory. 3. The processing system according to claim 1 , wherein the packet distribution circuitry includes tracking circuitry and is further configured to, in response to receiving the memory access request, storing metadata from the memory access request in the tracking circuitry and generating a modified memory access request for the requested data, and, in response to receiving the response data, forming a multicast response packet including the metadata and transmitting the multicast response packet to at least the one or more second ones of said plurality of processors. 4. The processing system according to claim 3 , wherein the stored metadata includes identifying information of the one or more other threads, and the modified memory access request is devoid of the identifying information of the one or more other threads. 5. The processing system according to claim 3 , wherein the packet distribution circuitry further includes packet generation circuitry that, in response to receiving the multicast response packet, generates a first response packet and a second response packet each routed to a respective one of the plurality of processors. 6. The processing system according to claim 5 , wherein the packet distribution circuitry is configured to transport the multicast response packet in a portion of the packet distribution circuitry before generating said first response packet and said second response packet. 7. The processing system according to claim 1 , wherein the plurality of threads comprise a plurality of cooperative thread arrays (CTAs) launched as a cooperative group array (CGA), wherein a respective one of the CTAs is launched on each processor of the plurality of processors. 8. The processing system according to claim 1 , wherein the memory access request comprises requester information, receiver information for each of a plurality of receivers, and requested data information. 9. The processing system according to claim 8 , wherein the receiver information further includes, for each receiving cooperative thread arrays (CTA), a receiver identifier and an offset in the corresponding shared memory area. 10. The processing system according to claim 9 , wherein the offset for each of the receiving CTAs is identical. 11. The processing system according to claim 9 , wherein all receiver identifiers are specified in a list. 12. The processing system according to claim 9 , wherein all receiver identifiers are specified in a bitmask. 13. The processing system according to claim 8 , wherein the memory access request further comprises a synchronization barrier offset in the distributed shared memory. 14. The processing system according to claim 8 , wherein the memory access request further comprises one or more operation to be performed by a receiver prior to or during writing the response data to the distributed shared memory area of the receiver. 15. The processing system according to claim 1 , wherein at least one of the respective processors is configured to, in response to receiving multicast data, transmit an acknowledgment to another one of the processors. 16. The processing system according to claim 1 , comprising a first set of counters and a second set of counters for each of the plurality of processors, the first set of counters at the first processor including a respective counter representing multicast data received from each of a plurality of other said processors, and the second set of counters at the first processor including a counter representing multicast data requested by the first processor on behalf of others of the plurality of processors. 17. The processing system according to claim 1 , wherein the packet distribution circuitry comprises a plurality of crossbar switches, each crossbar switch connecting a cache portion to the plurality of processors. 18. The processing system according to claim 1 , wherein the packet distribution circuitry is configured to select a crossbar switch for transporting a packet including the requested data based on a destination identifier of the packet. 19. The processing system according to claim 1 , wherein the packet distribution circuitry is configured to write respective portions of the response data to the distributed shared memory of at least the one or more second ones of the plurality of processors. 20. The processing system according to claim 1 , wherein the processing system comprises a graphics processing unit (GPU). 21. A system comprising at least one central processing unit (CPU) and at least one processing system according to claim 1 .

Assignees

Inventors

Classifications

  • G06F9/3887Primary

    controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Synchronisation and timing concerns (synchronisation on a memory bus G06F13/4234) · CPC title

  • G06T1/20Primary

    Processor architectures; Processor configuration, e.g. pipelining · CPC title

  • using crossbar or matrix · CPC title

  • Memory management · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12020035B2 cover?
This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the av…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3887. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).