What technology area does this patent fall under?

Primary CPC classification G06F9/3887. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Programmatically controlled data multicasting across multiple compute engines

US12020035B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12020035-B2
Application number	US-202217691288-A
Country	US
Kind code	B2
Filing date	Mar 10, 2022
Priority date	Mar 10, 2022
Publication date	Jun 25, 2024
Grant date	Jun 25, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.

First claim

Opening claim text (preview).

The invention claimed is: 1. A processing system comprising: a plurality of processors; a distributed shared memory comprising a plurality of distributed shared memory areas, wherein each processor of the plurality of processors is locally connected to a respective one of the plurality of distributed shared memory areas, wherein the plurality of processors are configured to simultaneously execute a plurality of threads, one of the threads executing on a first of the plurality of processors generating a memory access request for data for one or more other threads of the threads executing on one or more second ones of the plurality of processors; and packet distribution circuitry configured to route to each processor of the one or more second ones of the plurality of processors, for storage in the distributed shared memory locally connected to said each processor, a respective portion of response data received in response to the memory access request. 2. The processing system according to claim 1 further comprising a memory interface circuitry, wherein the memory interface circuitry is configured to transmit the memory access request to a memory hierarchy including a cache memory. 3. The processing system according to claim 1 , wherein the packet distribution circuitry includes tracking circuitry and is further configured to, in response to receiving the memory access request, storing metadata from the memory access request in the tracking circuitry and generating a modified memory access request for the requested data, and, in response to receiving the response data, forming a multicast response packet including the metadata and transmitting the multicast response packet to at least the one or more second ones of said plurality of processors. 4. The processing system according to claim 3 , wherein the stored metadata includes identifying information of the one or more other threads, and the modified memory access request is devoid of the identifying information of the one or more other threads. 5. The processing system according to claim 3 , wherein the packet distribution circuitry further includes packet generation circuitry that, in response to receiving the multicast response packet, generates a first response packet and a second response packet each routed to a respective one of the plurality of processors. 6. The processing system according to claim 5 , wherein the packet distribution circuitry is configured to transport the multicast response packet in a portion of the packet distribution circuitry before generating said first response packet and said second response packet. 7. The processing system according to claim 1 , wherein the plurality of threads comprise a plurality of cooperative thread arrays (CTAs) launched as a cooperative group array (CGA), wherein a respective one of the CTAs is launched on each processor of the plurality of processors. 8. The processing system according to claim 1 , wherein the memory access request comprises requester information, receiver information for each of a plurality of receivers, and requested data information. 9. The processing system according to claim 8 , wherein the receiver information further includes, for each receiving cooperative thread arrays (CTA), a receiver identifier and an offset in the corresponding shared memory area. 10. The processing system according to claim 9 , wherein the offset for each of the receiving CTAs is identical. 11. The processing system according to claim 9 , wherein all receiver identifiers are specified in a list. 12. The processing system according to claim 9 , wherein all receiver identifiers are specified in a bitmask. 13. The processing system according to claim 8 , wherein the memory access request further comprises a synchronization barrier offset in the distributed shared memory. 14. The processing system according to claim 8 , wherein the memory access request further comprises one or more operation to be performed by a receiver prior to or during writing the response data to the distributed shared memory area of the receiver. 15. The processing system according to claim 1 , wherein at least one of the respective processors is configured to, in response to receiving multicast data, transmit an acknowledgment to another one of the processors. 16. The processing system according to claim 1 , comprising a first set of counters and a second set of counters for each of the plurality of processors, the first set of counters at the first processor including a respective counter representing multicast data received from each of a plurality of other said processors, and the second set of counters at the first processor including a counter representing multicast data requested by the first processor on behalf of others of the plurality of processors. 17. The processing system according to claim 1 , wherein the packet distribution circuitry comprises a plurality of crossbar switches, each crossbar switch connecting a cache portion to the plurality of processors. 18. The processing system according to claim 1 , wherein the packet distribution circuitry is configured to select a crossbar switch for transporting a packet including the requested data based on a destination identifier of the packet. 19. The processing system according to claim 1 , wherein the packet distribution circuitry is configured to write respective portions of the response data to the distributed shared memory of at least the one or more second ones of the plurality of processors. 20. The processing system according to claim 1 , wherein the processing system comprises a graphics processing unit (GPU). 21. A system comprising at least one central processing unit (CPU) and at least one processing system according to claim 1 .

Assignees

Nvidia Corp

Inventors

Classifications

G06F9/3887Primary
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
G06F13/1689Primary
Synchronisation and timing concerns (synchronisation on a memory bus G06F13/4234) · CPC title
G06T1/20Primary
Processor architectures; Processor configuration, e.g. pipelining · CPC title
H04L49/101
using crossbar or matrix · CPC title
G06T1/60
Memory management · CPC title

Patent family

Related publications grouped by family.

View patent family 87760123

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12020035B2 cover?: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the av…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06F9/3887. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).