Network-aware cache coherence protocol enhancement
US-10402327-B2 · Sep 3, 2019 · US
US11275688B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11275688-B2 |
| Application number | US-201916700671-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 2, 2019 |
| Priority date | Dec 2, 2019 |
| Publication date | Mar 15, 2022 |
| Grant date | Mar 15, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processing system includes a plurality of compute units, with each compute unit having an associated first cache of a plurality of first caches, and a second cache shared by the plurality of compute units. The second cache operates to manage transfers of caches between the first caches of the plurality of first caches such that when multiple candidate first caches contain a valid copy of a requested cacheline, the second cache selects the candidate first cache having the shortest total path from the second cache to the candidate first cache and from the candidate first cache to the compute unit issuing a request for the requested cacheline.
Opening claim text (preview).
What is claimed is: 1. A processing system comprising: a plurality of compute units, each compute unit including at least one processor core and at least one private cache of a plurality of private caches, each private cache configured to store a corresponding set of cachelines; a shared cache that is shared by the plurality of compute units and coupled to the plurality of compute units via one or more interconnects, wherein the shared cache is configured to: in response to receipt of a request for an identified cacheline from a requesting compute unit, identify a subset of the plurality of private caches that has a valid copy of the identified cacheline; identify the private cache of the subset having a lowest transfer cost for providing a valid copy of the identified cacheline to the requesting compute unit; and transmit a probe request to a target compute unit having the identified private cache via at least one interconnect of the one or more interconnects; and wherein, in response to receipt of the probe request, the target compute unit is configured to transfer a valid copy of the identified cacheline to the requesting compute unit via at least one interconnect of the one or more interconnects; and wherein the shared cache is configured to identify which private cache of the subset has the lowest transfer cost by: determining, for each private cache of the subset, a corresponding transfer cost metric based on a first distance metric and a second distance metric, the first distance metric representing a distance between the shared cache and the private cache via the one or more interconnects and the second distance metric representing a distance between the private cache and the requesting compute unit; and identifying the private cache having the lowest corresponding transfer cost metric as the private cache with the lowest transfer cost. 2. The processing system of claim 1 , wherein the lowest transfer cost is based on a sum of the first distance metric and the second distance metric. 3. The processing system of claim 2 , wherein the first distance metric and the second distance metric are expressed in terms of clock cycles. 4. The processing system of claim 2 , wherein: the shared cache is configured to determine the transfer cost metrics for the private caches of the subset based on topology information representing a topology of the compute units, the shared cache, and the one or more interconnects. 5. The processing system of claim 4 , wherein the topology information further represents one or more policies regarding transfer of cachelines via the one or more interconnects. 6. The processing system of claim 4 , wherein: the topology information is implemented as a look-up table accessible by the shared cache, the look-up table configured to receive as inputs an identifier of the requesting compute unit and an identifier of the compute unit having a corresponding private cache, and to provide as an output a corresponding transfer cost metric. 7. The processing system of claim 4 , wherein: the topology information is implemented as hardware logic accessible by the shared cache, the hardware logic configured to receive as inputs an identifier of the requesting compute unit and an identifier of the compute unit having a corresponding private cache, and to provide as an output a corresponding transfer cost metric. 8. The processing system of claim 7 , wherein the hardware logic is one of: hard-coded logic or programmable logic. 9. The processing system of claim 4 , wherein: the topology information includes information representing at least one of: a representation of a physical topology of paths between the plurality of compute units via the one or more interconnects; characteristics of the one or more interconnects; and at least one policy for transferring cachelines; and the shared cache is configured to determine the transfer cost metrics based on calculations performed using the information. 10. The processing system of claim 1 , further comprising: a shadow tag memory accessible by the shared cache, the shadow tag memory comprising a plurality of entries, each entry storing state and address information for a corresponding cacheline of one of the private caches; and wherein the shared cache is to identify the subset of the plurality of private caches that has a valid copy of the identified cacheline using the shadow tag memory. 11. The processing system of claim 1 , wherein: the probe request includes at least one of an identifier of the requesting compute unit and an identifier for the request. 12. The processing system of claim 1 , wherein: the shared cache is configured to store a separate set of cachelines; and responsive to determining the separate set of cachelines includes a valid copy of the identified cacheline, the shared cache is to transfer a copy of the identified cacheline to the requesting compute unit to satisfy the request for the identified cacheline in place of identifying a subset of the plurality of private caches, identifying a private cache, and transmitting a probe request. 13. A method for cacheline transfers in a system comprising a plurality of compute units and a shared cache, each compute unit including at least one private cache of a plurality of private caches, the method comprising: in response to a request for an identified cacheline from a requesting compute unit, identifying, at the shared cache, a subset of the compute units that have a valid copy of the identified cacheline; identifying, at the shared cache, the private cache of the subset having a lowest transfer cost for providing a valid copy of the identified cacheline to the requesting compute unit; transmitting a probe request from the shared cache to a target compute unit having the identified private cache via at least one interconnect of the one or more interconnects; and in response to receipt of the probe request, transmitting a valid copy of the identified cacheline from the target compute unit to the requesting compute unit via at least one interconnect of the one or more interconnects; and wherein identifying which private cache of the subset has the lowest transfer cost comprises: determining, for each private cache of the subset, a corresponding transfer cost metric based on a first distance metric and a second distance metric, the first distance metric representing a distance between the shared cache and the private cache via the one or more interconnects and the second distance metric representing a distance between the private cache and the requesting compute unit; and identifying the private cache having the lowest corresponding transfer cost metric as the private cache with the lowest transfer cost. 14. The method of claim 13 , wherein the lowest transfer cost is based on a sum of the first distance metric and the second distance metric. 15. The method of claim 14 , wherein the first distance metric and the second distance metric are expressed in terms of clock cycles. 16. The method of claim 14 , wherein: determining a corresponding transfer cost metric comprises determining the corresponding transfer cost metric based on topology information representing a topology of the compute units, the shared cache, and the one or more interconnects. 17. The method of claim 16 , wherein: the topology information is implemented as at least one of: a look-up table accessible by the shared cache, the look-up table configured to receive as inputs an identifier of the requesting compute unit and an identifier of the target compu
using adaptive policy · CPC title
Scalability · CPC title
Resource optimization · CPC title
Hit rate improvement · CPC title
using selective caching, e.g. bypass · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.