Reducing the number of sequential operations in an application to be performed on a shared memory cell
US-9449360-B2 · Sep 20, 2016 · US
US2016139934A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016139934-A1 |
| Application number | US-201414543027-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 17, 2014 |
| Priority date | Nov 17, 2014 |
| Publication date | May 19, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods may process a single atomic operation. An instruction set may be generated to replace a plurality of atomic operations with a single atomic operation. The instruction set may include an accumulation instruction to compute a prefix sum for a plurality of initial values associated with a plurality of processing lanes to generate a plurality of accumulated values. The instruction set may also include a broadcast instruction to return a pre-existing value to be added with each of the plurality of accumulated values to generate a plurality of intermediate accumulated values. In one example, a graphics processor may execute the instruction set to process the single atomic operation.
Opening claim text (preview).
We claim: 1 . A system comprising: an instruction module to generate an instruction set to replace a plurality of atomic operations with a single atomic operation, the instruction module including: an accumulation module to generate an accumulation instruction to compute a prefix sum for a plurality of initial values associated with a plurality of processing lanes to generate a plurality of accumulated values; and a broadcast module to generate a broadcast instruction to return a pre-existing value to be added with each of the plurality of accumulated values to generate a plurality of intermediate accumulated values; and a graphics processor to execute the instruction set to process the single atomic operation. 2 . The system of claim 1 , wherein the instruction set is to include two or more of a same number of instructions for a uniform source value operation and a non-uniform source value operation, only about 5 instructions to about 10 instructions, and no loops. 3 . The system of claim 1 , wherein the instruction module further includes two or more of: a move module to generate a move instruction to copy an accumulation result value based on the plurality of accumulated values from an accumulation register to a result register; an atomic instruction module to generate an atomic instruction to add the accumulation result value with the pre-existing value to generate an atomic instruction result value that is to replace the pre-existing value in memory; and a subtraction module to generate a subtract instruction to subtract between each of the plurality of initial values and each of the plurality of intermediate accumulated values to generate a plurality of final values associated with the plurality of processing lanes. 4 . The system of claim 3 , wherein the instruction module further includes a partition module to generate a partition instruction to logically partition the plurality of processing lanes into two or more subsets, wherein the accumulation module is to generate a first accumulation instruction for a plurality of first initial values associated with a first subset of the plurality of processing lanes to generate a plurality of first accumulated values and a second accumulation instruction for a plurality of second initial values associated with a second subset of the plurality of processing lanes to generate a plurality of second accumulated values. 5 . The system of claim 4 , wherein the instruction module further includes: a combination module to generate a combination instruction to add a first accumulation result value based on the plurality of first accumulated values with a second accumulation result value based on the plurality of second accumulated values to generate a combined accumulation result value; and a subset value update module to generate an update instruction to add the first accumulation result value with each of the plurality of second accumulated values to generate a plurality of updated accumulated values. 6 . The system of claim 5 , wherein the atomic instruction module is to generate an atomic instruction to add the combined accumulation result value with the pre-existing value to generate the atomic instruction result value that is to replace the pre-existing value in the memory, and wherein the broadcast module is to generate a first broadcast instruction to return the pre-existing value to be added with each of the plurality of first accumulated values to generate a plurality of first intermediate accumulated values and a second broadcast instruction to return the pre-existing value to be added with each of the plurality of updated accumulated values to generate a plurality of second intermediate accumulated values. 7 . The system of claim 6 , wherein the subtraction module is to generate a first subtract instruction to subtract between each of the plurality of first initial values and each of the plurality of first intermediate accumulated values and a second subtract instruction to subtract between each of the plurality of second initial values and each of the plurality of second intermediate accumulated values to generate the plurality of final values associated with the plurality of processing lanes. 8 . The system of claim 1 , further including a compiler to apply the instruction module to generate the instruction set in a graphics hardware machine language, wherein the graphics processor is to include a single instruction multiple data (SIMD) architecture, and wherein the partial prefix sum is to be computed up to an SIMD execution engine length including one or more of eight processing lanes, sixteen processing lanes, and thirty-two processing lanes. 9 . A computer implemented method comprising: generating an instruction set to replace a plurality of atomic operations with a single atomic operation including: generating an accumulation instruction to compute a prefix sum for a plurality of initial values associated with a plurality of processing lanes to generate a plurality of accumulated values; and generating a broadcast instruction to return a pre-existing value to be added with each of the plurality of accumulated values to generate a plurality of intermediate accumulated values; and executing the instruction set to process the single atomic operation. 10 . The computer implemented method of claim 9 , wherein the instruction set includes two or more of a same number of instructions for a uniform source value operation and a non-uniform source value operation, only about 5 instructions to about 10 instructions, and no loops. 11 . The computer implemented method of claim 9 , further including two or more of: generating a move instruction to copy an accumulation result value based on the plurality of accumulated values from an accumulation register to a result register; generating an atomic instruction to add the accumulation result value with the pre-existing value to generate an atomic instruction result value that is to replace the pre-existing value in memory; and generating a subtract instruction to subtract between each of the plurality of initial values and each of the plurality of intermediate accumulated values to generate a plurality of final values associated with the plurality of processing lanes. 12 . The computer implemented method of claim 11 , further including: generating a partition instruction to logically partition the plurality of processing lanes into two or more subsets; generating a first accumulation instruction for a plurality of first initial values associated with a first subset of the plurality of processing lanes to generate a plurality of first accumulated values; and generating a second accumulation instruction for a plurality of second initial values associated with a second subset of the plurality of processing lanes to generate a plurality of second accumulated values. 13 . The computer implemented method of claim 12 , further including: generating a combination instruction to add a first accumulation result value based on the plurality of first accumulated values with a second accumulation result value based on the plurality of second accumulated values to generate a combined accumulation result value; and generating an update instruction to add the first accumulation result value with each of the plurality of second accumulated values to generate a plurality of updated accumulated values. 14 . The computer implemented method of claim 13 , further including: generating an atomic instruction to add the combined accumulation result value with the pre-existing value to generate the atomic instruction resu
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE · CPC title
using a mask · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
Runtime instruction translation, e.g. macros · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.