Generalized acceleration of matrix multiply accumulate operations

US11816481B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11816481-B2
Application numberUS-202217890540-A
CountryUS
Kind codeB2
Filing dateAug 18, 2022
Priority dateMay 8, 2017
Publication dateNov 14, 2023
Grant dateNov 14, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor, comprising: an instruction cache; an L1 cache; an L2 cache; a crossbar (Xbar); arithmetic logic units (ALUs); a front end unit to read commands written by a host processor; a work distribution unit to dispatch tasks to a plurality of processing clusters; a register file to store matrices specified in a matrix-fused multiply accumulate (MFMA) instruction, wherein the MFMA instruction is to multiply a first matrix with a second matrix and sum a result with a third matrix, and wherein each element of the matrices is to be encoded as floating point; logic circuitry to calculate a dot product, wherein the dot product includes: accumulating a plurality of partial products generated by multiplying each element of a first vector with a corresponding element of a second vector; and summing the plurality of partial products with an element of a matrix; and wherein results of the MFMA instruction are to be accumulated in the register file. 2. The processor of claim 1 , wherein the processor comprises the plurality of processing clusters and the plurality of processing clusters comprise the ALUs. 3. The processor of claim 1 , wherein the plurality of processing clusters are general processing clusters (GPCs). 4. The processor of claim 1 , wherein the processor is a parallel processing unit (PPU). 5. The processor of claim 1 , wherein the processor is a graphics processing unit (GPU). 6. The processor of claim 1 , further comprising one or more streaming multiprocessors (SMs) to calculate, at least in part, the dot product. 7. The processor of claim 1 , further comprising one or more streaming multiprocessors (SMs), wherein the one or more SMs comprise the ALUs. 8. The processor of claim 1 , further comprising a memory management unit (MMU). 9. The processor of claim 1 , wherein the ALUs comprise a floating point ALU and an integer ALU. 10. The processor of claim 1 , further comprising a host interface unit to decode packets received from the host processor. 11. A machine-readable medium comprising instructions that, if performed by one or more processors, cause the one or more processors to: calculate a dot product by: accumulating a plurality of partial products generated by multiplying each element of a first vector with a corresponding element of a second vector; and summing the plurality of partial products with an element of a matrix; wherein the one or more processors comprise: an instruction cache; an L1 cache; an L2 cache; a crossbar (Xbar); arithmetic logic units (ALUs); a front end unit to read commands written by a host processor; a work distribution unit to dispatch tasks to a plurality of processing clusters; and a register file to store matrices specified in a matrix-fused multiply accumulate (MFMA) instruction, wherein the MFMA instruction is to multiply a first matrix with a second matrix and sum a result with a third matrix, and wherein each element of the matrices is to be encoded as floating point; and wherein results of the MFMA instruction are to be accumulated in the register file. 12. The machine-readable medium of claim 11 , wherein the plurality of processing clusters are general processing clusters (GPCs) and the GPCs comprise the ALUs. 13. The machine-readable medium of claim 11 , wherein the one or more processors are one or more parallel processing units (PPUs). 14. The machine-readable medium of claim 11 , wherein the one or more processors are graphics processing units (GPUs). 15. The machine-readable medium of claim 11 , further comprising instructions that, if performed by the one or more processors, cause the one or more processors to calculate the dot product by one or more streaming multiprocessors (SMs). 16. The machine-readable medium of claim 11 , wherein the one or more processors further comprise one or more streaming multiprocessors (SMs) and the one or more SMs comprise the ALUs. 17. The machine-readable medium of claim 11 , wherein the one or more processors further comprise a unit to manage memory. 18. The machine-readable medium of claim 11 , wherein the ALUs comprise at least one of a floating point ALU and an integer ALU. 19. The machine-readable medium of claim 11 , further comprising instructions that, if performed by the one or more processors, cause the one or more processors to decode packets received from the host processor by a host interface unit. 20. The machine-readable medium of claim 11 , wherein each element of the matrices is to be encoded as half-precision floating point.

Assignees

Inventors

Classifications

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • with variable precision · CPC title

  • Arithmetic instructions · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11816481B2 cover?
A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs …
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/30014. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).