Implementing Fundamental Computational Primitives Using A Matrix Multiplication Accelerator (MMA)
US-2018253402-A1 · Sep 6, 2018 · US
US12164593B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12164593-B2 |
| Application number | US-202117374988-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 13, 2021 |
| Priority date | Dec 7, 2018 |
| Publication date | Dec 10, 2024 |
| Grant date | Dec 10, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.
Opening claim text (preview).
What is claimed is: 1. A system, comprising: a memory; a lookup data structure stored in the memory; a first vector buffer configured to store a first vector that is used as a first address into the lookup data structure; and a second vector buffer configured to store a second vector that is used as a second address into the lookup data structure; wherein the lookup data structure is configured to provide a result based on a lookup operation, the result being generated based on a computation of the first vector and the second vector. 2. The system of claim 1 , further comprising one or more lookup data structure buffers configured to receive one or more lookup data structure entries from the lookup data structure, and to determine a product of the first vector and the second vector based at least in part on the one or more lookup data structure entries. 3. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to store the one or more lookup data structure entries from the lookup data structure, and wherein the second vector buffer is configured to stream the second vector to the one or more lookup data structure buffers. 4. The system of claim 2 , wherein the product is a first product, the circuit further comprising: one or more adders configured to sum the first product and a second product; and an output buffer configured to store a result of the sum of the first product and the second product. 5. The system of claim 4 , wherein the one or more lookup data structure buffers are configured to determine the second product using a value of a third vector as the second address and a value of a fourth vector as the first address, into the lookup data structure. 6. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to determine the product using a value of the first vector as the second address and a value of the second vector as the first address, into the lookup data structure. 7. The system of claim 2 , wherein the memory, the lookup data structure, the first vector buffer, the one or more lookup data structure buffers, and the second vector buffer are configured to reduce latency. 8. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to store one or more vectors to determine one or more products of the one or more vectors. 9. The system of claim 2 , further comprising one or more bank units. 10. The system of claim 9 , wherein, the one or more bank units are configured to form a pipelined dataflow chain in which partial output data from one bank unit from among the one or more bank units is fed into another bank unit from among the one or more bank units for data accumulation. 11. The system of claim 9 , wherein a first bank unit from among the one or more bank units is configured to output the product to a second bank unit. 12. The system of claim 11 , wherein the second bank unit is configured to store the product received from the first bank unit. 13. The system of claim 12 , wherein: the product is a first product; the second bank unit is configured to receive a third vector from the memory in a streaming fashion; the one or more lookup data structure buffers of the second bank unit are configured to determine a second product based on the third vector using the lookup data structure; and an output buffer of the second bank unit is configured to store a sum of the first product and the second product. 14. The system of claim 9 , wherein: the one or more bank units is a systolic array that is configured to propagate partial sums; and the one or more bank units is configured to receive one or more input matrix vectors in a streaming fashion, and to propagate the one or more input matrix vectors in a direction that is different from a data flow direction of the partial sums. 15. The system of claim 14 , wherein the memory is a DRAM memory, the system further comprising: a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including one or more channels, wherein: the one or more channels include the one or more bank units; and the one or more bank units include the DRAM, the lookup data structure and the one or more lookup data structure buffers. 16. The system of claim 15 , wherein the NDP-DF accelerator unit die is one of a plurality of NDP-DF accelerator unit dies that are stacked one atop another. 17. A memory device, comprising: a lookup table; a first vector buffer configured to store a first vector that is used as a row address into the lookup table; a second vector buffer configured to store a second vector that is used as a column address into the lookup table; and one or more lookup table buffers configured to receive one or more lookup table entries from the lookup table, and to determine a product of the first vector and the second vector based at least in part on the one or more lookup table entries. 18. The device of claim 17 , further comprising one or more bank units configured to form a pipelined dataflow chain in which partial output data from one bank unit from among the one or more bank units is fed into another bank unit from among the one or more bank units for data accumulation, wherein: the product is a first product; the second bank unit is configured to receive a third vector in a streaming fashion; the one or more lookup table buffers are configured to determine a second product based on the third vector using the lookup table; and an output buffer is configured to store a sum of the first product and the second product. 19. The device of claim 18 , further comprising a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including one or more channels, wherein: the one or more bank units is a systolic array that is configured to propagate partial sums; the one or more bank units is configured to receive one or more input matrix vectors in a streaming fashion, and to propagate the one or more input matrix vectors in a direction that is different from a data flow direction of the partial sums; the one or more channels include the one or more bank units; and the one or more bank units include the lookup table and the one or more lookup table buffers. 20. A dataflow accelerator method, comprising: storing a lookup table stored in a memory; storing, by a first vector buffer, a first vector that is used as a row address into the lookup table; storing, by a second vector buffer, a second vector that is used as a column address into the lookup table; receiving, by one or more lookup table buffers, one or more lookup table entries from the lookup table; and determining, by the one or more lookup table buffers, a product of the first vector and the second vector based at least in part on the one or more lookup table entries.
Quantised networks; Sparse networks; Compressed networks · CPC title
Three-dimensional [3D] integrated devices · CPC title
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Combinations of networks · CPC title
based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.