Sparse convolutional neural network accelerator
US-10891538-B2 · Jan 12, 2021 · US
US11409658B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11409658-B2 |
| Application number | US-202117161465-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 28, 2021 |
| Priority date | Mar 15, 2019 |
| Publication date | Aug 9, 2022 |
| Grant date | Aug 9, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments are generally directed to data prefetching for graphics data processing. An embodiment of an apparatus includes one or more processors including one or more graphics processing units (GPUs); and a plurality of caches to provide storage for the one or more GPUs, the plurality of caches including at least an L1 cache and an L3 cache, wherein the apparatus to provide intelligent prefetching of data by a prefetcher of a first GPU of the one or more GPUs including measuring a hit rate for the L1 cache; upon determining that the hit rate for the L1 cache is equal to or greater than a threshold value, limiting a prefetch of data to storage in the L3 cache, and upon determining that the hit rate for the L1 cache is less than a threshold value, allowing the prefetch of data to the L1 cache.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: one or more processors including one or more graphics processing units (GPUs); and a plurality of caches to provide storage for the one or more GPUs, the plurality of caches including at least a lower level cache and a higher level cache; and wherein the apparatus to provide intelligent prefetching of data by a prefetcher of a first GPU of the one or more GPUs including: measuring a hit rate for the lower level cache over a sampling period, comparing the hit rate for the lower level cache to a threshold value number of hits; upon determining that the hit rate for the lower level cache is equal to or greater than the threshold value, limiting a prefetch of data to storage in the higher level cache, and upon determining that the hit rate for the lower level cache is less than the threshold value, allowing the prefetch of data to both the lower level cache and the higher level cache; and wherein, upon a compute operation operating out of the higher level cache, the apparatus is further to utilize a memory link during the operation of the higher level cache to maintain activity of memory bandwidth. 2. The apparatus of claim 1 , wherein the one or more processors are further to determine higher level cache and memory activity at least in part utilizing the memory bandwidth. 3. The apparatus of claim 2 , wherein the one or more processors are further to trigger prefetching and memory scrubbing activities based at least in part on the determined higher level cache and memory activity. 4. The apparatus of claim 1 , wherein the apparatus further includes an interface to receive prefetch instructions from prefetchers of the one or more GPUs, and wherein the apparatus is to detect and eliminate unnecessary prefetches, including: upon the apparatus detecting two or more prefetches having a duplicate address, the apparatus is to eliminate one or more of the prefetches having the duplicate address; or upon the apparatus detecting a prefetch that relates to data that is uncacheable, the apparatus is to eliminate the prefetch. 5. The apparatus of claim 1 , further comprising an execution unit of the one or more GPUs, the execution unit including a hardware preprocessor, the hardware preprocessor to have access to a table of IP addresses that a kernel is using, wherein the hardware preprocessor is to commence prefetching of IP addresses from the table of IP addresses ahead of execution of a thread. 6. The apparatus of claim 1 , wherein a prefetcher of a GPU of the one or more GPUs is to prefetch an instruction directly into an instruction cache (I-cache), and wherein the prefetch of the instruction directly into the I-cache is to occur upon an application driver being aware of a next kernel, and the prefetch being issued for the next kernel when starting execution of a current kernel. 7. One or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: measuring a hit rate for an lower level cache over a sampling period for a first graphics processing unit (GPU) of one or more GPUs of a computing system, the computing system further including a higher level cache; receiving a prefetch of data for the first GPU; comparing the hit rate for the lower level cache to a threshold value number of hits; upon determining that the hit rate for the lower level cache is equal to or greater than the threshold value, limiting the prefetch of the data to storage in the higher level cache; upon determining that the hit rate for the lower level cache is less than the threshold value, allowing the prefetch of the data to both the lower level cache and the higher level cache; and upon a compute operation operating out of the higher level cache, utilizing a memory link during the operation of the higher level cache to maintain activity of memory bandwidth. 8. The one or more computer-readable storage mediums of claim 7 , further comprising instructions for determining higher level cache and memory activity at least in part utilizing the memory bandwidth. 9. The one or more computer-readable storage mediums of claim 8 , further comprising instructions for triggering prefetching and memory scrubbing activities based at least in part on the determined higher level cache and memory activity. 10. The one or more computer-readable storage mediums of claim 7 , further comprising instructions for detecting and eliminating unnecessary prefetches, including: upon detecting two or more prefetches having a duplicate address, eliminating one or more of the prefetches having the duplicate address; or upon detecting a prefetch that relates to data that is uncacheable, eliminating the prefetch. 11. The one or more computer-readable storage mediums of claim 7 , further comprising instructions for commencing prefetching of IP addresses ahead of execution of a thread from a table of IP addresses that a kernel is using, wherein an execution unit of the one or more GPUs includes a hardware preprocessor, the hardware preprocessor having access to the table of IP addresses. 12. The one or more computer-readable storage mediums of claim 7 , further comprising instructions for prefetching an instruction directly into an instruction cache (I-cache), wherein the prefetch of the instruction directly into the I-cache is to occur upon an application driver being aware of a next kernel, and wherein the prefetch being issued for the next kernel when starting execution of a current kernel. 13. A method comprising: measuring a hit rate for an lower level cache over a sampling period for a first graphics processing unit (GPU) of one or more GPUs of a computing system, the computing system further including a higher level cache; receiving a prefetch of data for the first GPU; comparing the hit rate for the lower level cache to a threshold value number of hits; upon determining that the hit rate for the lower level cache is equal to or greater than the threshold value, limiting the prefetch of the data to storage in the higher level cache; upon determining that the hit rate for the lower level cache is less than the threshold value, allowing the prefetch of the data to both the lower level cache and the higher level cache; and upon a compute operation operating out of the higher level cache, utilizing a memory link during the operation of the higher level cache to maintain activity of memory bandwidth. 14. The method of claim 13 , further comprising determining higher level cache and memory activity at least in part utilizing the memory bandwidth. 15. The method of claim 14 , further comprising triggering prefetching and memory scrubbing activities based at least in part on the determined higher level cache and memory activity. 16. The method of claim 13 , further comprising detecting and eliminating unnecessary prefetches, including: upon detecting two or more prefetches having a duplicate address, eliminating one or more of the prefetches having the duplicate address; or upon detecting a prefetch that relates to data that is uncacheable, eliminating the prefetch. 17. The method of claim 13 , further comprising commencing prefetching of IP addresses ahead of execution of a thread from a table of IP addresses that a kernel is using, wherein an execution unit of the one or more GPUs includes a hardware preprocessor, the hardware preprocessor having access to the table of IP addresses. 18
with two or more cache hierarchy levels (with multilevel cache hierarchies G06F12/0811) · CPC title
using selective caching, e.g. bypass · CPC title
Details relating to cache prefetching · CPC title
Details relating to cache mapping · CPC title
with prefetch · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.