Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

US12164593B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12164593-B2
Application numberUS-202117374988-A
CountryUS
Kind codeB2
Filing dateJul 13, 2021
Priority dateDec 7, 2018
Publication dateDec 10, 2024
Grant dateDec 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a memory; a lookup data structure stored in the memory; a first vector buffer configured to store a first vector that is used as a first address into the lookup data structure; and a second vector buffer configured to store a second vector that is used as a second address into the lookup data structure; wherein the lookup data structure is configured to provide a result based on a lookup operation, the result being generated based on a computation of the first vector and the second vector. 2. The system of claim 1 , further comprising one or more lookup data structure buffers configured to receive one or more lookup data structure entries from the lookup data structure, and to determine a product of the first vector and the second vector based at least in part on the one or more lookup data structure entries. 3. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to store the one or more lookup data structure entries from the lookup data structure, and wherein the second vector buffer is configured to stream the second vector to the one or more lookup data structure buffers. 4. The system of claim 2 , wherein the product is a first product, the circuit further comprising: one or more adders configured to sum the first product and a second product; and an output buffer configured to store a result of the sum of the first product and the second product. 5. The system of claim 4 , wherein the one or more lookup data structure buffers are configured to determine the second product using a value of a third vector as the second address and a value of a fourth vector as the first address, into the lookup data structure. 6. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to determine the product using a value of the first vector as the second address and a value of the second vector as the first address, into the lookup data structure. 7. The system of claim 2 , wherein the memory, the lookup data structure, the first vector buffer, the one or more lookup data structure buffers, and the second vector buffer are configured to reduce latency. 8. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to store one or more vectors to determine one or more products of the one or more vectors. 9. The system of claim 2 , further comprising one or more bank units. 10. The system of claim 9 , wherein, the one or more bank units are configured to form a pipelined dataflow chain in which partial output data from one bank unit from among the one or more bank units is fed into another bank unit from among the one or more bank units for data accumulation. 11. The system of claim 9 , wherein a first bank unit from among the one or more bank units is configured to output the product to a second bank unit. 12. The system of claim 11 , wherein the second bank unit is configured to store the product received from the first bank unit. 13. The system of claim 12 , wherein: the product is a first product; the second bank unit is configured to receive a third vector from the memory in a streaming fashion; the one or more lookup data structure buffers of the second bank unit are configured to determine a second product based on the third vector using the lookup data structure; and an output buffer of the second bank unit is configured to store a sum of the first product and the second product. 14. The system of claim 9 , wherein: the one or more bank units is a systolic array that is configured to propagate partial sums; and the one or more bank units is configured to receive one or more input matrix vectors in a streaming fashion, and to propagate the one or more input matrix vectors in a direction that is different from a data flow direction of the partial sums. 15. The system of claim 14 , wherein the memory is a DRAM memory, the system further comprising: a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including one or more channels, wherein: the one or more channels include the one or more bank units; and the one or more bank units include the DRAM, the lookup data structure and the one or more lookup data structure buffers. 16. The system of claim 15 , wherein the NDP-DF accelerator unit die is one of a plurality of NDP-DF accelerator unit dies that are stacked one atop another. 17. A memory device, comprising: a lookup table; a first vector buffer configured to store a first vector that is used as a row address into the lookup table; a second vector buffer configured to store a second vector that is used as a column address into the lookup table; and one or more lookup table buffers configured to receive one or more lookup table entries from the lookup table, and to determine a product of the first vector and the second vector based at least in part on the one or more lookup table entries. 18. The device of claim 17 , further comprising one or more bank units configured to form a pipelined dataflow chain in which partial output data from one bank unit from among the one or more bank units is fed into another bank unit from among the one or more bank units for data accumulation, wherein: the product is a first product; the second bank unit is configured to receive a third vector in a streaming fashion; the one or more lookup table buffers are configured to determine a second product based on the third vector using the lookup table; and an output buffer is configured to store a sum of the first product and the second product. 19. The device of claim 18 , further comprising a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including one or more channels, wherein: the one or more bank units is a systolic array that is configured to propagate partial sums; the one or more bank units is configured to receive one or more input matrix vectors in a streaming fashion, and to propagate the one or more input matrix vectors in a direction that is different from a data flow direction of the partial sums; the one or more channels include the one or more bank units; and the one or more bank units include the lookup table and the one or more lookup table buffers. 20. A dataflow accelerator method, comprising: storing a lookup table stored in a memory; storing, by a first vector buffer, a first vector that is used as a row address into the lookup table; storing, by a second vector buffer, a second vector that is used as a column address into the lookup table; receiving, by one or more lookup table buffers, one or more lookup table entries from the lookup table; and determining, by the one or more lookup table buffers, a product of the first vector and the second vector based at least in part on the one or more lookup table entries.

Assignees

Inventors

Classifications

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Three-dimensional [3D] integrated devices · CPC title

  • Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

  • based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12164593B2 cover?
A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).