Accelerating eight-way parallel keccak execution
US-2024211268-A1 · Jun 27, 2024 · US
US9619226B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9619226-B2 |
| Application number | US-201113992230-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 23, 2011 |
| Priority date | Dec 23, 2011 |
| Publication date | Apr 11, 2017 |
| Grant date | Apr 11, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal add or subtract of packed data elements in response to a single vector packed horizontal add or subtract instruction that includes a destination vector register operand, a source vector register operand, and an opcode are describes.
Opening claim text (preview).
What is claimed is: 1. A method of performing in a computer processor vector packed horizontal add or subtract of packed data elements in response to a single vector packed horizontal add or subtract instruction that includes a destination vector register operand, a source vector register operand, an immediate, and an opcode, wherein the source vector register comprises a plurality of packed data elements divided into a plurality of data lanes, each data lane corresponds to a destination data element in the destination vector register, the immediate comprises at least a same number of active bits as there are packed data elements in each data lane, and each active bit of the immediate corresponds to one of the plurality of packed data elements in each data lane, the method comprising: executing the single vector packed horizontal add or subtract instruction to, for each data lane of the source vector register, read a value of each active bit position of the immediate to determine whether to negate a value of corresponding data element position of the data lane, responsively negate the values determined to be negated, and sum all negated and unchanged packed data elements in each data lane to create a data lane result; and storing each data lane result in a corresponding destination data element position of the destination register. 2. The method of claim 1 , wherein each data lane of the source vector register has four packed data elements. 3. The method of claim 1 , wherein a number of data lanes to be processed is dependent upon size of the destination vector register. 4. The method of claim 1 , wherein the source and destination vector registers are 128-bit, 256-bit, or 512-bit in size. 5. The method of claim 1 , wherein the packed data elements of the source vector register and the destination data elements of the destination vector register are 8-bit, 16-bit, 32-bit, or 64-bit in size. 6. The method of claim 5 , wherein the size of the packed data elements of the source vector register is defined by the opcode. 7. The method of claim 1 , wherein the immediate is an 8-bit value. 8. An article of manufacture comprising: a non-transitory tangible machine-readable storage medium having stored thereon an occurrence of an instruction, wherein the instruction's format specifies as source operands a vector register and an immediate and specifies as destination a single destination vector register, the source vector register comprises a plurality of packed data elements divided into a plurality of data lanes, each data lane corresponds to a destination data element in the destination vector register, the immediate comprises at least a same number of active bits as there are packed data elements in each data lane, each active bit of the immediate corresponds to one of the plurality of packed data elements in each data lane, and the instruction format includes an opcode which instructs a machine, responsive to a single occurrence of the instruction, to cause for each data lane of the source vector register, a reading of a value of each active bit position of the immediate to determine whether to negate a value of corresponding data element position of the data lane, responsively negate the values determined to be negated, and sum all negated and unchanged packed data elements in each data lane to create a data lane result, and store each data lane result in a corresponding destination data element position of the destination register. 9. The article of manufacture of claim 8 , wherein each data lane of the source vector register has four packed data elements. 10. The article of manufacture of claim 8 , wherein a number of data lanes to be processed is dependent upon size of the data elements of the destination vector register. 11. The article of manufacture of claim 8 , wherein the source and destination vector registers are 128-bit, 256-bit, or 512-bit in size. 12. The article of manufacture of claim 8 , wherein the packed data elements of the source vector register and destination data elements of the destination vector registers are 8-bit, 16-bit, 32-bit, or 64-bit in size. 13. The article of manufacture of claim 12 , wherein the size of the packed data elements of the source vector registers is defined by the opcode. 14. The article of manufacture of claim 8 , wherein the immediate is an 8-bit value. 15. An apparatus comprising; a hardware decoder to decode a single instruction that includes a destination vector register operand, a source vector register operand, an immediate, and an opcode, wherein the source vector register comprises a plurality of packed data elements divided into a plurality of data lanes, each data lane corresponds to a destination data element in the destination vector register, the immediate comprises at least a same number of active bits as there are packed data elements in each data lane, and each active bit of the immediate corresponds to one of the plurality of packed data elements in each data lane; and execution circuitry to execute the decoded instruction to, for each data lane of the source vector register, read a value of each active bit position of the immediate to determine whether to negate a value of corresponding data element position of the data lane, responsively negate the values determined to be negated, and sum all negated and unchanged packed data elements in each data lane to create a data lane result, and store each data lane result in a corresponding destination data element position of the destination register. 16. The apparatus of claim 15 , wherein each data lane of the source vector register has four packed data elements. 17. The apparatus of claim 15 , wherein a number of data lanes to be processed is dependent upon size of the destination vector register. 18. The apparatus of claim 15 , wherein the source and destination vector registers are 128-bit, 256-bit, or 512-bit in size. 19. The apparatus of claim 15 , wherein the packed data elements of the source vector register and destination data elements of the destination vector register are 8-bit, 16-bit, 32-bit, or 64-bit in size. 20. The apparatus of claim 15 , wherein the immediate is an 8-bit value.
Arithmetic instructions · CPC title
Vector processors · CPC title
with variable precision · CPC title
Logical and Boolean instructions, e.g. XOR, NOT · CPC title
according to one or more bits in the instruction, e.g. prefix, sub-opcode · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.