Accelerating eight-way parallel keccak execution
US-2024211268-A1 · Jun 27, 2024 · US
US9606797B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9606797-B2 |
| Application number | US-201213724633-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 21, 2012 |
| Priority date | Dec 21, 2012 |
| Publication date | Mar 28, 2017 |
| Grant date | Mar 28, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, the present invention includes a processor with a vector execution unit to execute a vector instruction on a vector having a plurality of individual data elements, where the vector instruction is of a first width and the vector execution unit is of a smaller width. The processor further includes a control logic coupled to the vector execution unit to compress a number of execution cycles consumed in execution of the vector instruction when at least some of the individual data elements are not to be operated on by the vector instruction. Other embodiments are described and claimed.
Opening claim text (preview).
What is claimed is: 1. A processor comprising: an execution unit having a data path including a plurality of lanes, each of the lanes to execute an operation on at least one channel of a plurality of channels of a single instruction multiple data (SIMD) instruction responsive to the SIMD instruction, the execution unit having a plurality of quadrants and to perform the SIMD instruction in a number of execution cycles; and a decode logic including compaction circuitry to calculate a minimum number of execution cycles to execute the SIMD instruction based on an active lane count, compare the minimum number of execution cycles to an active quadrant value, and based on the comparison, compact the number of execution cycles, including permutation of at least some of the plurality of channels of the SIMD instruction, wherein a number of permutations between the quadrants is minimized by the compaction circuitry, to reduce the number of execution cycles for execution of the SIMD instruction based at least in part on the calculation and an execution mask associated with the SIMD instruction, the execution mask based at least in part on an instruction predicate mask, a dispatch mask and a conditional mask. 2. The processor of claim 1 , wherein the compaction circuitry is to reduce the number of execution cycles for execution of the SIMD instruction by at least one execution cycle when the execution mask indicates that a set of channels of the SIMD instruction to be issued to the execution unit during the at least one execution cycle are to be unused. 3. The processor of claim 2 , wherein the compaction circuitry is to cause a next set of channels of the SIMD instruction to be inserted into the at least one execution cycle. 4. The processor of claim 2 , wherein the execution unit is to execute the SIMD instruction in a first number of execution cycles less than the number of execution cycles as a result of reduction of the number of execution cycles by the at least one execution cycle. 5. The processor of claim 1 , further comprising permute circuitry coupled to the execution unit to permute at least some of the plurality of channels of the SIMD instruction prior to input to the execution unit, responsive to control information from the compaction circuitry. 6. The processor of claim 5 , wherein a first portion of the plurality of channels obtained from the permutation are to be sent to the execution unit, and a second portion of the plurality of channels obtained from the permutation are not to be sent to the execution unit. 7. The processor of claim 1 , wherein the SIMD instruction is of a first path of a conditional block. 8. The processor of claim 1 , wherein the SIMD instruction is of a variable width SIMD instruction set architecture. 9. The processor of claim 1 , further comprising a split register file having a first set of half registers each to store a first plurality of channels of a SIMD instruction and a second set of half registers each to store a second plurality of channels of the SIMD instruction. 10. The processor of claim 1 , further comprising: a register file having a plurality of registers each to store a plurality of channels of a SIMD instruction; a latch to receive an operand from a register of the register file; permute circuitry coupled to the latch to receive the operand from the latch and control information from the decode logic and to permute at least portions of the operand; and an output logic coupled to the permute circuitry and including a plurality of switches, wherein a corresponding switch is to be enabled by the compaction circuitry to provide a corresponding portion of the permuted operand to the execution unit. 11. A non-transitory machine-readable medium having stored thereon instructions, which when performed by a machine cause the machine to perform a method comprising: receiving a single instruction multiple data (SIMD) instruction and information associated with the SIMD instruction in a SIMD execution unit of a processor, the SIMD instruction having a plurality of channels that are to consume a first plurality of execution cycles, the SIMD execution unit having a plurality of quadrants; identifying a first portion of the plurality of channels of the SIMD instruction that are to be disabled; calculating a minimum number of execution cycles to execute the SIMD instruction based on an active lane count, comparing the minimum number of execution cycles to an active quadrant value, and based on the comparing, compacting the first plurality of execution cycles, including permuting at least some of the plurality of channels of the SIMD instruction, wherein a number of permutations between the quadrants is minimized; removing one or more execution cycles of the first plurality of execution cycles for executing the SIMD instruction based on the calculating; and after the removing, executing the SIMD instruction in fewer execution cycles than the first plurality of execution cycles. 12. The non-transitory machine-readable medium of claim 11 , wherein the method further comprises inserting a second portion of the plurality of channels of the SIMD instruction into a first removed execution cycle. 13. The non-transitory machine-readable medium of claim 11 , wherein the method further comprises inserting a second portion of a plurality of channels of a second SIMD instruction into a first removed execution cycle. 14. The non-transitory machine-readable medium of claim 13 , wherein the SIMD instruction is of a first branch of a conditional operation and the second SIMD instruction is of a second branch of the conditional operation. 15. The non-transitory machine-readable medium of claim 11 , wherein the method further comprises permuting the at least some of the plurality of channels of the SIMD instruction, and thereafter identifying the first portion of the plurality of channels of the SIMD instruction that are to be disabled. 16. A system comprising: a processor comprising: a core domain including a plurality of cores to independently execute instructions; and a graphics domain including a plurality of graphics processors to perform general purpose workloads offloaded by the core domain, each of the graphics processors having a vector execution unit including a plurality of lanes each to execute an operation on at least one data element of a plurality of data elements identified by a vector instruction, the vector execution unit to perform the vector instruction on the plurality of data elements in a first number of execution cycles, and cycle compression circuitry coupled to the vector execution unit to reduce the first number of execution cycles based at least in part on an execution mask associated with the vector instruction, the execution mask based at least in part on an instruction predicate mask, a dispatch mask and a conditional mask, permute circuitry having an output coupled to an input to the vector execution unit to permute at least some of the plurality of data elements prior to input to the vector execution unit, responsive to control information from the cycle compression circuitry, and unpermute circuitry having an input coupled to an output of the vector execution unit to unpermute at least some of the plurality of data elements after output from the vector execution unit, responsive to control information from the cycle compression circuitry; and a dynamic random access memory (DRAM) coupled to the processor. 17. The system of claim 16 , wherein the cycle compression circuitry is to cause permutation of a first data element in a
Bit or string instructions · CPC title
Conditional branch instructions · CPC title
Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE · CPC title
to perform conditional operations, e.g. using predicates or guards · CPC title
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.