Sparse convolutional neural network accelerator
US-10528864-B2 · Jan 7, 2020 · US
US12367540B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12367540-B2 |
| Application number | US-202016951217-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 18, 2020 |
| Priority date | Nov 18, 2020 |
| Publication date | Jul 22, 2025 |
| Grant date | Jul 22, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus to facilitate processing in a multi-tile device is disclosed. The apparatus comprises a plurality of processing tiles, each including a memory device and a plurality of processing resources, coupled to the device memory, and a memory management unit to manage the memory devices in each of the plurality of tiles to perform allocation of memory resources among the memory devices for execution by the plurality of processing resources.
Opening claim text (preview).
What is claimed is: 1. An apparatus to facilitate processing in a multi-tile device, comprising: a plurality of distinct chiplets and a plurality of interconnect structures, respective chiplets of the plurality of distinct chiplets including a processing tile of a plurality of processing tiles, the plurality of distinct chiplets having a 2.5-dimensional (2.5D) arrangement, and respective processing tiles of the plurality of processing tiles include: a memory device; a plurality of processing resources, coupled to the memory device; and a memory management unit to manage the memory device in each of the plurality of processing tiles to perform allocation of memory resources of a workload among respective memory devices of the plurality of processing tiles to facilitate execution of the workload by the plurality of processing resources of the plurality of processing tiles of the plurality of distinct chiplets, wherein the memory management unit is configured to replicate a shared memory resource having a virtual address to respective memory devices of the plurality of processing tiles and enable work items of the workload executed at different processing tiles to access respective copies of the shared memory resource at different physical addresses via the virtual address. 2. The apparatus of claim 1 , wherein the memory management unit includes a page table associated with the respective memory devices of the plurality of processing tiles, each page table to store a different physical address associated with the virtual address of the shared memory resource. 3. The apparatus of claim 2 , wherein the workload is partitioned into a plurality of virtual partitions and the plurality of processing tiles are to respectively retrieve a virtual partition for execution based on a counter value that indicates a virtual partition identifier, the counter value to be incremented after retrieval of a virtual partition by a processing tile. 4. The apparatus of claim 3 , wherein a first processing tile is configured to retrieve a first virtual partition based on a first virtual partition identifier indicated by the counter value and a second processing tile is configured to retrieve a second virtual partition based on a second virtual partition identifier indicated by the counter value. 5. The apparatus of claim 4 , wherein the plurality of virtual partitions have associated dispatch parameters associated with work items to be dispatched on behalf of the virtual partition, the dispatch parameters including a global work size, a local work size and a work group count. 6. The apparatus of claim 1 , wherein the memory management unit is to distribute the memory resources among respective memory devices of the plurality of processing tiles using one of a plurality of memory distributions based on memory access characteristics of the workload. 7. The apparatus of claim 1 , wherein the memory management unit is to distribute the memory resources of the workload via assignment of a contiguous virtual address range to the memory resources across a portion of respective memory devices of the plurality of processing tiles. 8. The apparatus of claim 1 , wherein the plurality of processing resources are synchronized. 9. A method to facilitate processing in a multi-tile device, comprising: receiving a workload to be processed at a graphics processing unit including a plurality of distinct chiplets and a plurality of interconnect structures, respective chiplets of the plurality of distinct chiplets including a processing tile of a plurality of processing tiles, the plurality of distinct chiplets having a 2.5-dimensional (2.5D) arrangement; generating a plurality of virtual partitions to process the workload; retrieving virtual partitions of the plurality of virtual partitions by respective processing tiles of the plurality of processing tiles based on a virtual partition identifier provided by circuitry configured to indicate a virtual partition available for retrieval; and scheduling the plurality of virtual partitions for execution at a plurality of processing resources included in the plurality of processing tiles of the plurality of distinct chiplets. 10. The method of claim 9 , wherein a first virtual partition is executed at a first plurality of resources at a first processing tile and a second virtual partition is executed at a second plurality of resources at a second processing tile. 11. The method of claim 10 , further comprising synchronizing the first virtual partition and the second virtual partition. 12. The method of claim 11 , further comprising generating a command buffer upon receiving the workload. 13. The method of claim 12 , wherein the plurality of virtual partitions comprises a plurality of dispatch parameters. 14. The method of claim 13 , wherein the plurality of dispatch parameters comprise a global work size, a local work size and a work group count. 15. At least one non-transitory computer readable medium having instructions, which when executed by one or more processors, causes the one or more processors to: receive a workload to be processed at a graphics processing unit including a plurality of distinct chiplets and a plurality of interconnect structures, respective chiplets of the plurality of distinct chiplets including a processing tile of a plurality of processing tiles, the plurality of distinct chiplets having a 2.5-dimensional (2.5D) arrangement; generate a plurality of virtual partitions to process the workload; retrieving virtual partitions of the plurality of virtual partitions by respective processing tiles of the plurality of processing tiles based on a virtual partition identifier provided by circuitry configured to indicate a virtual partition available for retrieval; and schedule the plurality of virtual partitions for execution at a plurality of processing resources included in the plurality of processing tiles of the plurality of distinct chiplets. 16. The at least one non-transitory computer readable medium of claim 15 , wherein a first virtual partition is executed at a first plurality of resources at a first processing tile and a second virtual partition is executed at a second plurality of resources at a second processing tile. 17. The at least one non-transitory computer readable medium of claim 16 , having instructions, which when executed by one or more processors, further causes the one or more processors to synchronize the first virtual partition and the second virtual partition. 18. A graphics processing unit (GPU), comprising: a plurality of distinct chiplets and a plurality of interconnect structures, each of the plurality of distinct chiplets including a processing tile of a plurality of processing tiles, the plurality of distinct chiplets having a 2.5-dimensional (2.5D) arrangement, and respective processing tiles of the plurality of processing tiles include: a memory device; a plurality of processing resources, coupled to the memory device; an interface coupled between the plurality of processing tiles; and a memory management unit to manage the memory device in each of the plurality of processing tiles to perform allocation of memory resources among respective memory devices of the plurality of processing tiles to facilitate execution of a workload by the plurality of processing resources of the plurality of processing tiles of the plurality of distinct chiplets, wherein the memory management unit is configured to replicate a shared memory resource having a virtual address to respective memory devices of the plurality of
Improving or facilitating administration, e.g. storage management · CPC title
Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices · CPC title
Single storage device · CPC title
Processor architectures; Processor configuration, e.g. pipelining · CPC title
Allocation or management of cache space · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.