Independent tuning of multiple hardware prefetchers
US-2019095333-A1 · Mar 28, 2019 · US
US10540288B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10540288-B2 |
| Application number | US-201815949692-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 10, 2018 |
| Priority date | Feb 2, 2018 |
| Publication date | Jan 21, 2020 |
| Grant date | Jan 21, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are described in which a system having multiple processing units processes a series of work units in a processing pipeline, where some or all of the work units access or manipulate data stored in non-coherent memory. In one example, this disclosure describes a method that includes identifying, prior to completing processing of a first work unit with a processing unit of a processor having multiple processing units, a second work unit that is expected to be processed by the processing unit after the first work unit. The method also includes processing the first work unit, and prefetching, from non-coherent memory, data associated with the second work unit into a second cache segment of the buffer cache, wherein prefetching the data associated with the second work unit occurs concurrently with at least a portion of the processing of the first work unit by the processing unit.
Opening claim text (preview).
What is claimed is: 1. A method comprising: identifying, prior to completing processing of a first work unit with a processing unit of a processor having multiple processing units, a second work unit that is expected to be processed by the processing unit after the first work unit, each of the first work unit and the second work unit associated with one or more stream fragments, and each of the first work unit and the second work unit specifying a work unit handler for processing the one or more stream fragments; processing, by the processing unit, the first work unit, wherein processing the first work unit includes accessing first work unit data associated with the first work unit and stored within a first cache segment of a level one (L1) buffer cache for the processing unit and generating, from the first work unit data, modified first work unit data; prefetching, from non-coherent memory, second work unit data associated with the second work unit into a second cache segment of the L1 buffer cache, wherein prefetching the second work unit data associated with the second work unit occurs concurrently with at least a portion of the processing of the first work unit by the processing unit; flushing, by the processing unit and after processing the first work unit, the first cache segment of the L1 buffer cache, wherein flushing the first cache segment includes storing, in the non-coherent memory, the modified first work unit data; generating, by the processing unit, a message indicating that the modified first work unit data can be accessed from the non-coherent memory; processing, by the processing unit, the second work unit, wherein processing the second work unit includes accessing the second work unit data associated with the second work unit prefetched into the second cache segment of the L1 buffer cache and generating, from the second work unit data, modified second work unit data; identifying, by the processing unit and prior to completing processing of the second work unit, a third work unit that is expected to be processed by the processing unit after the second work unit; and prefetching, by the processing unit and from the non-coherent memory, third work unit data associated with the third work unit into the first cache segment of the L1 buffer cache, wherein prefetching the third work unit data associated with the third work unit occurs concurrently with at least a portion of the processing of the second work unit by the processing unit and concurrently with at least a portion of the flushing the first cache segment. 2. The method of claim 1 , wherein each of flushing the first cache segment, prefetching third work unit data associated with the third work unit, and processing the second work unit occur concurrently. 3. The method of claim 1 , wherein generating the message indicating that the modified first work unit data generated by the first work unit can be accessed from the non-coherent memory occurs prior to completion of the flushing of the first cache segment, the method further comprising: delivering, by the processing unit to a second processing unit, the message, wherein delivering the message is gated by completion of the flushing of the first cache segment. 4. The method of claim 1 , wherein generating the message indicating that the modified first work unit data generated by the first work unit can be accessed from the non-coherent memory transfers ownership of at least a portion of non-coherent memory. 5. The method of claim 1 , wherein the message specifies lines of data associated with the third work unit to prefetch. 6. The method of claim 1 , wherein prefetching second work unit data associated with the second work unit includes masking invalid addresses. 7. The method of claim 1 , wherein at least one of the first work unit and the second work unit includes an identifier of a subsequent work unit for further processing the one or more stream fragments upon completion of the work unit. 8. The method of claim 1 , wherein at least one of the first work unit and the second work unit includes one or more fields to store input or output arguments for processing the one or more stream fragments. 9. The method of claim 1 , wherein at least one of the first work unit and the second work unit includes one or more fields to store auxiliary variables to be used when processing the stream fragment. 10. The method of claim 1 , further comprising, prior to processing the first work unit and the second work unit, storing the first work unit and the second work unit in a work unit queue associated with the processing unit, and wherein identifying the second work unit that is expected to be processed by the processing unit after the first work unit comprises identifying the second work based on a position of the second work unit in the work unit queue. 11. The method of claim 1 , further comprising: prefetching, from coherent memory, information including at least one of: header information and state information. 12. The method of claim 1 , wherein each of the first work unit and the second work unit specify one of the processing units for executing the work unit handler. 13. The method of claim 1 , wherein the first cache segment includes a first plurality of logically associated cache lines within the L1 buffer cache, and wherein the second cache segment includes a second plurality of logically associated cache lines within the L1 buffer cache. 14. A device comprising: a plurality of processing units, each of the processing units configured to execute one or more of a plurality of work unit handlers (WU handlers) for processing stream fragments, and wherein each of the processing units include a level one (L1) buffer cache; a memory to store the stream fragments; a plurality of queues configured to hold work units, each of the work units associated with one or more stream fragments, and wherein each of the work units identifies one of the WU handlers for processing the one or more stream fragments; and a load store unit configured to: identify, prior to completion of processing of a first work unit by a first processing unit of the plurality of processing units, a second work unit that is expected to be processed by the first processing unit after the first work unit, wherein the first processing unit processes the first work unit by accessing first work unit data associated with the first work unit in an active segment of the L1 buffer cache included within the first processing unit and generating, from the first work unit data, modified first work unit data, prefetch, from the memory, second work unit data associated with the second work unit into a standby cache segment of the L1 buffer cache included within the first processing unit, wherein prefetching the second work unit data associated with the second work unit occurs concurrently with at least a portion of the processing of the first work unit by the first processing unit, flush, after the processing of the first work unit is complete, the active cache segment of the buffer cache, wherein flushing the active cache segment includes storing, in the memory, the modified first work unit data, and generate a message indicating that the modified first work unit data processed by the first work unit can be accessed from the memory. 15. The device of claim 14 , wherein the memory is non-coherent memory, and wherein the first processing unit is configured to: process the second work unit, wherein processing the second work unit includes accessing the second work unit data associated with the second work unit prefetched into the standby cac
Overlapped cache accessing, e.g. pipeline (G06F12/0846 takes precedence) · CPC title
Networked environment · CPC title
with main memory updating (G06F12/0806 takes precedence) · CPC title
Details of cache specific to multiprocessor cache arrangements · CPC title
using clearing, invalidating or resetting means · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.