Instructions and logic to perform floating point and integer operations for machine learning
US-2021182058-A1 · Jun 17, 2021 · US
US11816481B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11816481-B2 |
| Application number | US-202217890540-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 18, 2022 |
| Priority date | May 8, 2017 |
| Publication date | Nov 14, 2023 |
| Grant date | Nov 14, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.
Opening claim text (preview).
What is claimed is: 1. A processor, comprising: an instruction cache; an L1 cache; an L2 cache; a crossbar (Xbar); arithmetic logic units (ALUs); a front end unit to read commands written by a host processor; a work distribution unit to dispatch tasks to a plurality of processing clusters; a register file to store matrices specified in a matrix-fused multiply accumulate (MFMA) instruction, wherein the MFMA instruction is to multiply a first matrix with a second matrix and sum a result with a third matrix, and wherein each element of the matrices is to be encoded as floating point; logic circuitry to calculate a dot product, wherein the dot product includes: accumulating a plurality of partial products generated by multiplying each element of a first vector with a corresponding element of a second vector; and summing the plurality of partial products with an element of a matrix; and wherein results of the MFMA instruction are to be accumulated in the register file. 2. The processor of claim 1 , wherein the processor comprises the plurality of processing clusters and the plurality of processing clusters comprise the ALUs. 3. The processor of claim 1 , wherein the plurality of processing clusters are general processing clusters (GPCs). 4. The processor of claim 1 , wherein the processor is a parallel processing unit (PPU). 5. The processor of claim 1 , wherein the processor is a graphics processing unit (GPU). 6. The processor of claim 1 , further comprising one or more streaming multiprocessors (SMs) to calculate, at least in part, the dot product. 7. The processor of claim 1 , further comprising one or more streaming multiprocessors (SMs), wherein the one or more SMs comprise the ALUs. 8. The processor of claim 1 , further comprising a memory management unit (MMU). 9. The processor of claim 1 , wherein the ALUs comprise a floating point ALU and an integer ALU. 10. The processor of claim 1 , further comprising a host interface unit to decode packets received from the host processor. 11. A machine-readable medium comprising instructions that, if performed by one or more processors, cause the one or more processors to: calculate a dot product by: accumulating a plurality of partial products generated by multiplying each element of a first vector with a corresponding element of a second vector; and summing the plurality of partial products with an element of a matrix; wherein the one or more processors comprise: an instruction cache; an L1 cache; an L2 cache; a crossbar (Xbar); arithmetic logic units (ALUs); a front end unit to read commands written by a host processor; a work distribution unit to dispatch tasks to a plurality of processing clusters; and a register file to store matrices specified in a matrix-fused multiply accumulate (MFMA) instruction, wherein the MFMA instruction is to multiply a first matrix with a second matrix and sum a result with a third matrix, and wherein each element of the matrices is to be encoded as floating point; and wherein results of the MFMA instruction are to be accumulated in the register file. 12. The machine-readable medium of claim 11 , wherein the plurality of processing clusters are general processing clusters (GPCs) and the GPCs comprise the ALUs. 13. The machine-readable medium of claim 11 , wherein the one or more processors are one or more parallel processing units (PPUs). 14. The machine-readable medium of claim 11 , wherein the one or more processors are graphics processing units (GPUs). 15. The machine-readable medium of claim 11 , further comprising instructions that, if performed by the one or more processors, cause the one or more processors to calculate the dot product by one or more streaming multiprocessors (SMs). 16. The machine-readable medium of claim 11 , wherein the one or more processors further comprise one or more streaming multiprocessors (SMs) and the one or more SMs comprise the ALUs. 17. The machine-readable medium of claim 11 , wherein the one or more processors further comprise a unit to manage memory. 18. The machine-readable medium of claim 11 , wherein the ALUs comprise at least one of a floating point ALU and an integer ALU. 19. The machine-readable medium of claim 11 , further comprising instructions that, if performed by the one or more processors, cause the one or more processors to decode packets received from the host processor by a host interface unit. 20. The machine-readable medium of claim 11 , wherein each element of the matrices is to be encoded as half-precision floating point.
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
with variable precision · CPC title
Arithmetic instructions · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.