What technology area does this patent fall under?

Primary CPC classification G06N3/045. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

US12164593B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12164593-B2
Application number	US-202117374988-A
Country	US
Kind code	B2
Filing date	Jul 13, 2021
Priority date	Dec 7, 2018
Publication date	Dec 10, 2024
Grant date	Dec 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a memory; a lookup data structure stored in the memory; a first vector buffer configured to store a first vector that is used as a first address into the lookup data structure; and a second vector buffer configured to store a second vector that is used as a second address into the lookup data structure; wherein the lookup data structure is configured to provide a result based on a lookup operation, the result being generated based on a computation of the first vector and the second vector. 2. The system of claim 1 , further comprising one or more lookup data structure buffers configured to receive one or more lookup data structure entries from the lookup data structure, and to determine a product of the first vector and the second vector based at least in part on the one or more lookup data structure entries. 3. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to store the one or more lookup data structure entries from the lookup data structure, and wherein the second vector buffer is configured to stream the second vector to the one or more lookup data structure buffers. 4. The system of claim 2 , wherein the product is a first product, the circuit further comprising: one or more adders configured to sum the first product and a second product; and an output buffer configured to store a result of the sum of the first product and the second product. 5. The system of claim 4 , wherein the one or more lookup data structure buffers are configured to determine the second product using a value of a third vector as the second address and a value of a fourth vector as the first address, into the lookup data structure. 6. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to determine the product using a value of the first vector as the second address and a value of the second vector as the first address, into the lookup data structure. 7. The system of claim 2 , wherein the memory, the lookup data structure, the first vector buffer, the one or more lookup data structure buffers, and the second vector buffer are configured to reduce latency. 8. The system of claim 2 , wherein the one or more lookup data structure buffers are configured to store one or more vectors to determine one or more products of the one or more vectors. 9. The system of claim 2 , further comprising one or more bank units. 10. The system of claim 9 , wherein, the one or more bank units are configured to form a pipelined dataflow chain in which partial output data from one bank unit from among the one or more bank units is fed into another bank unit from among the one or more bank units for data accumulation. 11. The system of claim 9 , wherein a first bank unit from among the one or more bank units is configured to output the product to a second bank unit. 12. The system of claim 11 , wherein the second bank unit is configured to store the product received from the first bank unit. 13. The system of claim 12 , wherein: the product is a first product; the second bank unit is configured to receive a third vector from the memory in a streaming fashion; the one or more lookup data structure buffers of the second bank unit are configured to determine a second product based on the third vector using the lookup data structure; and an output buffer of the second bank unit is configured to store a sum of the first product and the second product. 14. The system of claim 9 , wherein: the one or more bank units is a systolic array that is configured to propagate partial sums; and the one or more bank units is configured to receive one or more input matrix vectors in a streaming fashion, and to propagate the one or more input matrix vectors in a direction that is different from a data flow direction of the partial sums. 15. The system of claim 14 , wherein the memory is a DRAM memory, the system further comprising: a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including one or more channels, wherein: the one or more channels include the one or more bank units; and the one or more bank units include the DRAM, the lookup data structure and the one or more lookup data structure buffers. 16. The system of claim 15 , wherein the NDP-DF accelerator unit die is one of a plurality of NDP-DF accelerator unit dies that are stacked one atop another. 17. A memory device, comprising: a lookup table; a first vector buffer configured to store a first vector that is used as a row address into the lookup table; a second vector buffer configured to store a second vector that is used as a column address into the lookup table; and one or more lookup table buffers configured to receive one or more lookup table entries from the lookup table, and to determine a product of the first vector and the second vector based at least in part on the one or more lookup table entries. 18. The device of claim 17 , further comprising one or more bank units configured to form a pipelined dataflow chain in which partial output data from one bank unit from among the one or more bank units is fed into another bank unit from among the one or more bank units for data accumulation, wherein: the product is a first product; the second bank unit is configured to receive a third vector in a streaming fashion; the one or more lookup table buffers are configured to determine a second product based on the third vector using the lookup table; and an output buffer is configured to store a sum of the first product and the second product. 19. The device of claim 18 , further comprising a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including one or more channels, wherein: the one or more bank units is a systolic array that is configured to propagate partial sums; the one or more bank units is configured to receive one or more input matrix vectors in a streaming fashion, and to propagate the one or more input matrix vectors in a direction that is different from a data flow direction of the partial sums; the one or more channels include the one or more bank units; and the one or more bank units include the lookup table and the one or more lookup table buffers. 20. A dataflow accelerator method, comprising: storing a lookup table stored in a memory; storing, by a first vector buffer, a first vector that is used as a row address into the lookup table; storing, by a second vector buffer, a second vector that is used as a column address into the lookup table; receiving, by one or more lookup table buffers, one or more lookup table entries from the lookup table; and determining, by the one or more lookup table buffers, a product of the first vector and the second vector based at least in part on the one or more lookup table entries.

Assignees

Samsung Electronics Co Ltd

Inventors

Classifications

G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title
H10D88/00
Three-dimensional [3D] integrated devices · CPC title
G06F7/5443
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
G06N3/045Primary
Combinations of networks · CPC title
G06N3/008
based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title

Patent family

Related publications grouped by family.

View patent family 70971724

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12164593B2 cover?: A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store…
Who is the assignee on this patent?: Samsung Electronics Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).