Reducing power consumption in a fused multiply-add (FMA) unit of a processor
US-9778911-B2 · Oct 3, 2017 · US
US11487541B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11487541-B2 |
| Application number | US-202017107134-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 30, 2020 |
| Priority date | Oct 20, 2016 |
| Publication date | Nov 1, 2022 |
| Grant date | Nov 1, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source field representing a plurality of packed data source operands that have packed data elements of a second size, and a field for a memory location that stores a scalar value. A register file having a plurality of packed data registers includes registers for the plurality of packed data source operands that have packed data elements of a first size, the source operands that have packed data elements of a second size, and the destination operand. Execution circuitry executes the decoded single instruction to perform iterations of packed fused multiply accumulate operations by multiplying packed data elements of the sources of the first type by sub-elements of the scalar value, and adding results of these multiplications to an initial value in a first iteration and a result from a previous iteration in subsequent iterations.
Opening claim text (preview).
What is claimed is: 1. A processor comprising: a memory controller; an interconnect fabric coupled to the memory controller; an instruction cache to store instructions fetched from a system memory via the memory controller; a data cache to store source data elements to be processed in response to the instructions and result data elements comprising results of the instructions; a next level cache to store the source data elements, the result data elements, and the instructions; and a plurality of data parallel processing circuits coupled to the interconnect fabric, the plurality of data parallel processing circuits to perform parallel operations on a plurality of the source data elements, at least one data parallel processing circuit comprising: operand storage to store a first plurality of the source data elements at a first precision and a second plurality of source data elements at a second precision which is four times the first precision; execution circuitry comprising a plurality of multiply-accumulate circuits to execute a plurality of fused multiply-accumulate (FMA) instructions to perform parallel FMA operations using at least a portion of the first and second plurality of source data elements to generate a plurality of result data elements at the second precision, a multiply-accumulate circuit of the plurality of multiply-accumulate circuits comprising: one or more multipliers to perform four parallel multiplications of source data elements from the first plurality, the multiplications including a first multiplication of a first source data element and a second source data element to generate a first product, a second multiplication of a third source data element and a fourth source data element to generate a second product, a third multiplication of a fifth source data element and a sixth source data element to generate a third product, and a fourth multiplication of a seventh source data element and an eighth source data element to generate a fourth product, and one or more adder circuits to add the first product, second product, third product, fourth product, and a ninth source data element of the second plurality of source data elements to generate a first result data element at the second precision. 2. The apparatus of claim 1 wherein the first plurality of source data elements at the first precision comprise data elements of first and second matrices on which a dot-product is to be performed. 3. The apparatus of claim 1 further comprising: an I/O interface coupled to the interconnect fabric, the I/O interface to couple the plurality of data parallel processing circuits to an I/O device. 4. The apparatus of claim 1 wherein the execution circuitry comprises vector execution circuitry and the operand storage comprises a vector register file. 5. The apparatus of claim 4 further comprising: scalar execution circuitry to execute scalar instructions; and a scalar register file to store operands to be used as source values for the scalar instructions. 6. The apparatus of claim 5 further comprising: direct memory access (DMA) circuitry coupled to the interconnect fabric, the DMA circuitry to provide direct access to the system memory. 7. The apparatus of claim 1 further comprising: cache coherency circuitry to maintain coherency of the data elements across the data cache, the next level cache, and the system memory. 8. The apparatus of claim 1 wherein the plurality of multiply-accumulate circuits comprise a plurality of fused multiply-accumulate (FMA) circuits. 9. A method comprising: fetching instructions from a system memory via a memory controller, the instructions to be stored in an instruction cache; storing source data elements of the instructions in a data cache; performing parallel operations on a plurality of data parallel processing circuits using a plurality of the source data elements in accordance with a first one or more of the instructions, at least a portion of the parallel operations comprising: storing in an operand storage a first plurality of the source data elements at a first precision and a second plurality of source data elements at a second precision which is four times the first precision; performing parallel fused multiply-accumulate (FMA) operations on a plurality multiply-accumulate circuits using at least a portion of the first and second plurality of source data elements to generate a plurality of result data elements at the second precision, a parallel FMA operation performed on one of the plurality of multiply-accumulate circuits comprising: performing four parallel multiplications of source data elements from the first plurality, the multiplications including a first multiplication of a first source data element and a second source data element to generate a first product, a second multiplication of a third source data element and a fourth source data element to generate a second product, a third multiplication of a fifth source data element and a sixth source data element to generate a third product, and a fourth multiplication of a seventh source data element and an eighth source data element to generate a fourth product, and adding the first product, second product, third product, fourth product, and a ninth source data element of the second plurality of source data elements to generate a first result data element at the second precision. 10. The method of claim 9 wherein the first plurality of source data elements at the first precision comprise data elements of first and second matrices on which a dot-product is to be performed. 11. The method of claim 9 wherein the operand storage comprises a vector register file. 12. The method of claim 9 wherein a second one or more of the instructions comprise scalar instructions, the method further comprising: executing the scalar instructions with scalar execution circuitry, the scalar execution circuitry including a scalar register file to store operands to be used as source values for the scalar instructions. 13. The method of claim 12 further comprising: providing direct access to the system memory via direct memory access (DMA) circuitry. 14. The method of claim 9 further comprising: maintaining coherency of the source data elements across the data cache, the system memory, and one or more additional caches. 15. The method of claim 9 wherein the plurality of multiply-accumulate circuits comprise a plurality of fused multiply-accumulate (FMA) circuits. 16. A machine-readable medium having instructions stored thereon which, when executed by a machine, causes the machine to perform the operations of: storing source data elements of the instructions in a data cache; performing parallel operations on a plurality of data parallel processing circuits using a plurality of the source data elements in accordance with a first one or more of the instructions, at least a portion of the parallel operations comprising: storing in an operand storage a first plurality of the source data elements at a first precision and a second plurality of source data elements at a second precision which is four times the first precision; performing parallel fused multiply-accumulate (FMA) operations on a plurality multiply-accumulate circuits using at least a portion of the first and second plurality of source data elements to generate a plurality of result data elements at the second precision, a parallel FMA operation performed on one of the plurality of multiply-accumulate circuits comprising: performing four parallel multiplications of source data elements from the first plurality, the mu
having multiple operands in a single register · CPC title
comprising data of variable length · CPC title
controlled in tandem, e.g. multiplier-accumulator · CPC title
Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.