Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

US11100193B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11100193-B2
Application numberUS-201916388860-A
CountryUS
Kind codeB2
Filing dateApr 18, 2019
Priority dateDec 7, 2018
Publication dateAug 24, 2021
Grant dateAug 24, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

First claim

Opening claim text (preview).

What is claimed is: 1. A general matrix-matrix multiplication (GEMM) dataflow accelerator semiconductor circuit, comprising: a memory bank; a peripheral lookup table stored in the memory bank; a first vector buffer configured to store a first vector that is used as a row address into the lookup table; a second vector buffer configured to store a second vector that is used as a column address into the lookup table; and one or more lookup table buffers configured to receive one or more lookup table entries, wherein the second vector buffer is configured to stream the second vector to the one or more lookup table buffers, and the one or more lookup table buffers are configured to store the one or more lookup table entries from the lookup table, wherein the one or more lookup table buffers are configured to determine a product of the first vector and the second vector based at least in part on the one or more lookup table entries from the lookup table. 2. The GEMM dataflow accelerator semiconductor circuit of claim 1 , wherein the product is a first product, the circuit further comprising: one or more adders configured to sum the first product and a second product; and an output buffer configured to store a result of the sum of the first product and the second product. 3. The GEMM dataflow accelerator semiconductor circuit of claim 2 , wherein the one or more lookup table buffers are configured to determine the first product using a value of the first vector and a value of the second vector as the column address and the row address, respectively, into the lookup table, without performing the multiply operation. 4. The GEMM dataflow accelerator semiconductor circuit of claim 2 , wherein the one or more lookup table buffers are configured to determine the second product using a value of a third vector and a value of a fourth vector as the column address and the row address, respectively, into the lookup table, without performing the multiply operation. 5. The GEMM dataflow accelerator semiconductor circuit of claim 1 , wherein the memory bank, the peripheral lookup table, the first vector buffer, the one or more lookup table buffers, and the second vector buffer form a hierarchical lookup architecture to reduce latency. 6. The GEMM dataflow accelerator semiconductor circuit of claim 1 , further comprising a plurality of lookup table buffers including the one or more lookup table buffers, wherein the plurality of lookup table buffers are configured to store a corresponding plurality of matrix vectors to determine a plurality of products of the plurality of matrix vectors without accessing the lookup table stored in the memory bank, and without performing the multiply operation. 7. The GEMM dataflow accelerator semiconductor circuit of claim 6 , further comprising a peripheral array of smart bank units, wherein the peripheral array of smart bank units are configured to form a pipelined dataflow chain in which partial output data from one smart bank unit from among the array of smart bank units is fed into another smart bank unit from among the array of smart bank units for data accumulation. 8. The GEMM dataflow accelerator semiconductor circuit of claim 7 , wherein each of the smart bank units includes the memory bank, the lookup table, the plurality of lookup table buffers, one or more adders, and an output buffer. 9. The GEMM dataflow accelerator semiconductor circuit of claim 8 , wherein a first smart bank unit from among the plurality of smart bank units is configured to output the product to a second smart bank unit that is adjacent to the first smart bank unit. 10. The GEMM dataflow accelerator semiconductor circuit of claim 9 , wherein the second smart bank unit is configured to store the product received from the first smart bank unit. 11. The GEMM dataflow accelerator semiconductor circuit of claim 10 , wherein: the product is a first product; the second smart bank unit is configured to receive a third vector from the memory bank in the streaming fashion; the one or more lookup table buffers of the second smart bank unit are configured to determine a second product based on the third vector using the lookup table without performing the multiply operation; the one or more adders of the second smart bank unit are configured to calculate a sum of the first product and the second product; and the output buffer of the second smart bank unit is configured to store the sum of the first product and the second product. 12. The GEMM dataflow accelerator semiconductor circuit of claim 11 , wherein: the second smart bank unit is configured to output the sum of the first product and the second product to a third smart bank unit from among the peripheral array of smart bank units, wherein the third smart bank unit is adjacent to the second smart bank unit; and the third smart bank unit is configured to store the sum. 13. The GEMM dataflow accelerator semiconductor circuit of claim 12 , wherein: the peripheral array of smart bank units is a systolic array that is configured to propagate partial sums in a serpentine fashion; and the peripheral array of smart bank units is configured to receive a plurality of input matrix vectors in a streaming fashion, and to propagate the plurality of input matrix vectors in a direction that is perpendicular to a data flow direction of the partial sums. 14. The GEMM dataflow accelerator semiconductor circuit of claim 13 , wherein the memory bank is a DRAM memory bank, the circuit further comprising: a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including a plurality of channels, wherein: each of the channels includes the peripheral array of smart bank units arranged in a serpentine fashion; and each of the smart bank units includes the DRAM bank, the lookup table, the plurality of lookup table buffers, the one or more adders, and the output buffer. 15. The GEMM dataflow accelerator semiconductor circuit of claim 14 , wherein the NDP-DF accelerator unit die is one of a plurality of NDP-DF accelerator unit dies that are stacked one atop another. 16. The GEMM dataflow accelerator semiconductor circuit of claim 15 , further comprising: a passive silicon interposer; a processor disposed on the passive silicon interposer; and a base die disposed on the passive silicon interposer adjacent to the processor, wherein the plurality of NDP-DF accelerator unit dies are stacked atop the base die. 17. The GEMM dataflow accelerator semiconductor circuit of claim 16 , further comprising: one or more through silicon vias (TSVs) disposed through the plurality of NDP-DF accelerator unit dies and the base die, wherein the one or more TSVs are configured to interconnect the plurality of NDP-DF accelerator unit dies with the base die, and the base die with the processor; and wherein the plurality of NDP-DF accelerator unit dies and the base die are configured to offload computation from the processor. 18. The GEMM dataflow accelerator semiconductor circuit of claim 15 , further comprising: a passive silicon interposer; a controller disposed on the passive silicon interposer; and a base die disposed on the passive silicon interposer adjacent to the controller, wherein the plurality of NDP-DF accelerator unit dies are stacked atop the base die. 19. The GEMM dataflow accelerator semiconductor circuit of claim 18 , further comprising: one or more through silicon vias (TSVs) disposed through the plurality of NDP-DF accelerator unit dies and the base die, wherein the one or more TSVs

Assignees

Inventors

Classifications

  • G06N3/045Primary

    Combinations of networks · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Three-dimensional [3D] integrated devices · CPC title

  • Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

  • based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11100193B2 cover?
A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 24 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).