Systems, apparatuses, and methods for chained fused multiply add

US11487541B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11487541-B2
Application numberUS-202017107134-A
CountryUS
Kind codeB2
Filing dateNov 30, 2020
Priority dateOct 20, 2016
Publication dateNov 1, 2022
Grant dateNov 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source field representing a plurality of packed data source operands that have packed data elements of a second size, and a field for a memory location that stores a scalar value. A register file having a plurality of packed data registers includes registers for the plurality of packed data source operands that have packed data elements of a first size, the source operands that have packed data elements of a second size, and the destination operand. Execution circuitry executes the decoded single instruction to perform iterations of packed fused multiply accumulate operations by multiplying packed data elements of the sources of the first type by sub-elements of the scalar value, and adding results of these multiplications to an initial value in a first iteration and a result from a previous iteration in subsequent iterations.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: a memory controller; an interconnect fabric coupled to the memory controller; an instruction cache to store instructions fetched from a system memory via the memory controller; a data cache to store source data elements to be processed in response to the instructions and result data elements comprising results of the instructions; a next level cache to store the source data elements, the result data elements, and the instructions; and a plurality of data parallel processing circuits coupled to the interconnect fabric, the plurality of data parallel processing circuits to perform parallel operations on a plurality of the source data elements, at least one data parallel processing circuit comprising: operand storage to store a first plurality of the source data elements at a first precision and a second plurality of source data elements at a second precision which is four times the first precision; execution circuitry comprising a plurality of multiply-accumulate circuits to execute a plurality of fused multiply-accumulate (FMA) instructions to perform parallel FMA operations using at least a portion of the first and second plurality of source data elements to generate a plurality of result data elements at the second precision, a multiply-accumulate circuit of the plurality of multiply-accumulate circuits comprising: one or more multipliers to perform four parallel multiplications of source data elements from the first plurality, the multiplications including a first multiplication of a first source data element and a second source data element to generate a first product, a second multiplication of a third source data element and a fourth source data element to generate a second product, a third multiplication of a fifth source data element and a sixth source data element to generate a third product, and a fourth multiplication of a seventh source data element and an eighth source data element to generate a fourth product, and one or more adder circuits to add the first product, second product, third product, fourth product, and a ninth source data element of the second plurality of source data elements to generate a first result data element at the second precision. 2. The apparatus of claim 1 wherein the first plurality of source data elements at the first precision comprise data elements of first and second matrices on which a dot-product is to be performed. 3. The apparatus of claim 1 further comprising: an I/O interface coupled to the interconnect fabric, the I/O interface to couple the plurality of data parallel processing circuits to an I/O device. 4. The apparatus of claim 1 wherein the execution circuitry comprises vector execution circuitry and the operand storage comprises a vector register file. 5. The apparatus of claim 4 further comprising: scalar execution circuitry to execute scalar instructions; and a scalar register file to store operands to be used as source values for the scalar instructions. 6. The apparatus of claim 5 further comprising: direct memory access (DMA) circuitry coupled to the interconnect fabric, the DMA circuitry to provide direct access to the system memory. 7. The apparatus of claim 1 further comprising: cache coherency circuitry to maintain coherency of the data elements across the data cache, the next level cache, and the system memory. 8. The apparatus of claim 1 wherein the plurality of multiply-accumulate circuits comprise a plurality of fused multiply-accumulate (FMA) circuits. 9. A method comprising: fetching instructions from a system memory via a memory controller, the instructions to be stored in an instruction cache; storing source data elements of the instructions in a data cache; performing parallel operations on a plurality of data parallel processing circuits using a plurality of the source data elements in accordance with a first one or more of the instructions, at least a portion of the parallel operations comprising: storing in an operand storage a first plurality of the source data elements at a first precision and a second plurality of source data elements at a second precision which is four times the first precision; performing parallel fused multiply-accumulate (FMA) operations on a plurality multiply-accumulate circuits using at least a portion of the first and second plurality of source data elements to generate a plurality of result data elements at the second precision, a parallel FMA operation performed on one of the plurality of multiply-accumulate circuits comprising: performing four parallel multiplications of source data elements from the first plurality, the multiplications including a first multiplication of a first source data element and a second source data element to generate a first product, a second multiplication of a third source data element and a fourth source data element to generate a second product, a third multiplication of a fifth source data element and a sixth source data element to generate a third product, and a fourth multiplication of a seventh source data element and an eighth source data element to generate a fourth product, and adding the first product, second product, third product, fourth product, and a ninth source data element of the second plurality of source data elements to generate a first result data element at the second precision. 10. The method of claim 9 wherein the first plurality of source data elements at the first precision comprise data elements of first and second matrices on which a dot-product is to be performed. 11. The method of claim 9 wherein the operand storage comprises a vector register file. 12. The method of claim 9 wherein a second one or more of the instructions comprise scalar instructions, the method further comprising: executing the scalar instructions with scalar execution circuitry, the scalar execution circuitry including a scalar register file to store operands to be used as source values for the scalar instructions. 13. The method of claim 12 further comprising: providing direct access to the system memory via direct memory access (DMA) circuitry. 14. The method of claim 9 further comprising: maintaining coherency of the source data elements across the data cache, the system memory, and one or more additional caches. 15. The method of claim 9 wherein the plurality of multiply-accumulate circuits comprise a plurality of fused multiply-accumulate (FMA) circuits. 16. A machine-readable medium having instructions stored thereon which, when executed by a machine, causes the machine to perform the operations of: storing source data elements of the instructions in a data cache; performing parallel operations on a plurality of data parallel processing circuits using a plurality of the source data elements in accordance with a first one or more of the instructions, at least a portion of the parallel operations comprising: storing in an operand storage a first plurality of the source data elements at a first precision and a second plurality of source data elements at a second precision which is four times the first precision; performing parallel fused multiply-accumulate (FMA) operations on a plurality multiply-accumulate circuits using at least a portion of the first and second plurality of source data elements to generate a plurality of result data elements at the second precision, a parallel FMA operation performed on one of the plurality of multiply-accumulate circuits comprising: performing four parallel multiplications of source data elements from the first plurality, the mu

Assignees

Inventors

Classifications

  • having multiple operands in a single register · CPC title

  • comprising data of variable length · CPC title

  • controlled in tandem, e.g. multiplier-accumulator · CPC title

  • Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title

  • Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11487541B2 cover?
Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source f…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3001. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).