What technology area does this patent fall under?

Primary CPC classification G06F9/30014. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generalized acceleration of matrix multiply accumulate operations

US11797302B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11797302-B2
Application number	US-202117351161-A
Country	US
Kind code	B2
Filing date	Jun 17, 2021
Priority date	May 8, 2017
Publication date	Oct 24, 2023
Grant date	Oct 24, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.

First claim

Opening claim text (preview).

What is claimed is: 1. A multi-threaded processor, comprising: a decoder to decode a matrix multiply and accumulate (MMA) instruction for signed matrix data; a buffer to store the signed matrix data specified by operands of the MMA instruction; a scheduler to schedule the MMA instruction; a fused multiply accumulate (FMA) unit to perform a dot product of corresponding elements of the signed matrix data; an arithmetic logic unit (ALU) to add a plurality of partial product results of the dot product to be accumulated into a register; and memory to store a result of the MMA instruction. 2. The multi-threaded processor of claim 1 , wherein the signed matrix data comprises 32-bit two's complement integer data. 3. The multi-threaded processor of claim 1 , wherein the signed matrix data comprises 16-bit two's complement integer data. 4. The multi-threaded processor of claim 1 , further comprising a tree of adders, where the tree of adders comprises at least a 3:2 carry sum adder (CSA). 5. The multi-threaded processor of claim 1 , further comprising a dispatch unit to transmit the MMA instruction to the FMA unit. 6. The multi-threaded processor of claim 1 , further comprising a register file to provide the register. 7. The multi-threaded processor of claim 1 , further comprising a special function unit (SFU). 8. The multi-threaded processor of claim 1 , further comprising an interconnect to connect the ALU and the register. 9. A system comprising the multi-threaded processor of claim 1 , wherein the system further comprises: a system bus to connect the multi-threaded processor to one or more peripheral devices; and one or more dynamic random access memory (DRAM) devices. 10. A single instruction multiple data (SIMD) multi-threaded processor, comprising: a plurality of cores to perform a matrix multiply and accumulate (MMA) instruction, where each of the plurality of cores comprises: a front end to fetch the MMA instruction; an instruction cache to store the MMA instruction; an L1 cache to store data; an L2 cache to store data; a plurality of ports to read from and write to a memory; one or more load/store units to read and write the memory; an interconnect to couple the memory and the plurality of cores; a decoder to decode the MMA instruction; a buffer to store signed matrix data specified by operands of the MMA instruction; a scheduler to schedule the MMA instruction; a fused multiply accumulate (FMA) unit to perform a dot product of corresponding elements of the signed matrix data; an arithmetic logic unit (ALU) to add a plurality of partial product results of the dot product to be accumulated into a register; and wherein the memory is to store a result of the MMA instruction. 11. The SIMD multi-threaded processor of claim 10 , wherein the L1 cache comprises at least 24 kilobytes (KB) of storage. 12. The SIMD multi-threaded processor of claim 10 , wherein the memory comprises at least 64 kilobytes (KB) of storage. 13. The SIMD multi-threaded processor of claim 10 , wherein the interconnect connects the one or more load/store units to the register. 14. The SIMD multi-threaded processor of claim 10 , wherein the scheduler dispatches the MMA instruction to one or more cores of the plurality of cores. 15. The SIMD multi-threaded processor of claim 10 , wherein the signed matrix data comprises 32-bit two's complement integer data. 16. The SIMD multi-threaded processor of claim 10 , wherein the signed matrix data comprises 16-bit two's complement integer data. 17. A computer-implemented method, comprising: decoding, by a decoder, a matrix multiply and accumulate (MMA) instruction for signed matrix data; storing, by a buffer, the signed matrix data specified by operands of the MMA instruction; scheduling, by a scheduler, the MMA instruction; performing, by a fused multiply accumulate (FMA) unit, a dot product of corresponding elements of the signed matrix data; adding, by an arithmetic logic unit (ALU), a plurality of partial product results of the dot product to be accumulated into a register; and storing, by a memory, a result of the MMA instruction. 18. The computer-implemented method of claim 17 , wherein the ALU comprises at least one adder. 19. The computer-implemented method of claim 17 , wherein the signed matrix data comprises 32-bit two's complement integer data. 20. The computer-implemented method of claim 17 , wherein the signed matrix data comprises 16-bit two's complement integer data. 21. The computer-implemented method of claim 17 , wherein the scheduler comprises a dispatch unit to dispatch the MMA instruction. 22. The computer-implemented method of claim 17 , further comprising accumulating the plurality of partial product results into the register using an interconnect. 23. The computer-implemented method of claim 17 , wherein a register file provides the register.

Assignees

Nvidia Corp

Inventors

Classifications

G06F9/3888
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title
G06F9/30036
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
G06F9/30014Primary
with variable precision · CPC title
G06F9/3001
Arithmetic instructions · CPC title

Patent family

Related publications grouped by family.

View patent family 64015316

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11797302B2 cover?: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs …
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06F9/30014. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Instructions and logic to perform floating point and integer operations for machine learning

Multiply-accumulate “0” data gating

Generalized acceleration of matrix multiply accumulate operations

Scalable memory-optimized hardware for matrix-solve

Memory interconnect network architecture for vector processor

Multiplying and adding matrices

Frequently asked questions