Time Domain Unrolling Sparse Matrix Multiplication System and Method
US-2022035890-A1 · Feb 3, 2022 · US
US12493779B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12493779-B2 |
| Application number | US-202117551876-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 15, 2021 |
| Priority date | Dec 15, 2020 |
| Publication date | Dec 9, 2025 |
| Grant date | Dec 9, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
We introduce SGCNAX, a scalable GCN accelerator architecture for the high-performance and energy-efficient acceleration of GCNs. Unlike prior GCN accelerators that either employ limited loop optimization techniques, or determine the design variables based on random sampling, we systematically explore the loop optimization techniques for GCN acceleration and provide a flexible GCN dataflow that adapts to different GCN configurations to achieve optimal efficiency. We further provide two hardware-based techniques to address the workload imbalance problem caused by the unbalanced distribution of zeros in GCNs. Specifically, SGCNAX exploits an outer-product-based computation architecture that mitigates the intra-PE (Processing Elements) workload imbalance, and employs a group-and-shuffle approach to mitigate the inter-PE workload imbalance.
Opening claim text (preview).
The invention claimed is: 1 . A scalable graph convolutional neural network (GCN) accelerator, comprising: an accelerator circuit including: a processing element (PE) array that includes a plurality of PEs, wherein the plurality of PEs each include; a multiply-and-accumulate (MAC) array configured to perform multiplication and accumulation, a sparse matrix buffer configured to store data of a sparse matrix, an input dense matrix buffer configured to store data of a first dense matrix, an output dense matrix buffer configured to store data of a second dense matrix; and a PE control unit configured to: control data reading from the sparse matrix buffer, the input dense matrix buffer, and the output dense matrix buffer to the MAC array, and control data writing from the MAC array to the output dense matrix buffer. 2 . The accelerator of claim 1 , wherein the multiply-and-accumulate (MAC) array is configured to: receive a non-zero scalar and a first position associated with the non-zero scalar from the sparse matrix buffer; receive a first row vector and a second position associated with a first element in the first row vector from the input dense matrix buffer; and multiply each of element values in the first row vector with the non-zero scalar to produce a second row vector. 3 . The accelerator of claim 2 , wherein the MAC array is further configured to: receive a third row vector from the output dense matrix buffer according to a third position; add the second row vector to the third row vector to generate a fourth row vector; and store the fourth row vector to the output dense matrix buffer according to the third position. 4 . The accelerator of claim 2 , wherein: the data of the sparse matrix includes a block of an input feature matrix, and the non-zero scalar is a scalar of the block of the input feature matrix; the data of the first dense matrix includes a block of an weight matrix, and the first row vector is a row vector of the block of the weight matrix; and the MAC array is configured to multiply each of element values in the first row vector of the weight matrix with the non-zero scalar of the input feature matrix to produce a second row vector. 5 . The accelerator of claim 4 , wherein the MAC array is further configured to: receive a third row vector from the output dense matrix buffer according to a third position, wherein the second row vector and the third row vector each are a partial product for an intermediate matrix; add the second row vector to the third row vector to generate a fourth row vector for the intermediate matrix; and store the fourth row vector to the output dense matrix buffer according to the third position. 6 . The accelerator of claim 2 , wherein: the data of the sparse matrix includes a block of an adjacency matrix and, the non-zero scalar is a scalar of the block of the adjacency matrix; the data of the first dense matrix includes a block of an intermediate matrix, and the first row vector is a row vector of the block of the intermediate matrix; and the MAC array is configured to multiply each of element values in the first row vector of the intermediate matrix with the non-zero scalar of the adjacency matrix to produce a second row vector. 7 . The accelerator of claim 6 , wherein the MAC array is further configured to: receive a third row vector from the output dense matrix buffer according to a third position, wherein the second row vector and the third row vector each are a partial product for an output matrix; add the second row vector to the third row vector to generate a fourth row vector for the output matrix; and store the fourth row vector to the output dense matrix buffer according to the third position. 8 . The accelerator of claim 1 , further comprising: a processing unit configured to: receive a graph input to be processed using a graph convolutional neural network having a plurality of layers, determine a partitioning of each of the plurality of layers of the graph convolutional neural network into a sequence of sublayers, and determine a processing order of the sublayers in the sequence; wherein: a PE of the plurality of PEs is configured to process the graph input by processing the sublayers in the sequence according to the determined processing order. 9 . The accelerator of claim 1 , wherein the accelerator circuit further includes: a data dispatcher coupled to the PE array and configured to distribute input data to the PE array; and a permutation network coupled to the PE array and configured to collect output data from the PE array. 10 . The accelerator of claim 9 , wherein: the data dispatcher is configured to group a plurality of rows of a matrix into a plurality of groups according to a density-sorted rank order; the data dispatcher is further configured to map the plurality of groups to the plurality of PEs of the PE array, respectively; and the permutation network is configured to unshuffle outputs from the plurality of PEs of the PE array. 11 . The accelerator of claim 9 , wherein: the accelerator chip further includes an accumulator buffer; and the permutation network is configured to send output data from the PE array to the accumulator buffer. 12 . The accelerator of claim 1 , further comprising: an off-chip memory external to the accelerator circuit; wherein the accelerator circuit further includes: a first buffer that is inside the accelerator chip; and a circuit control unit configured to control the reading of data from the off-chip memory to the first buffer.
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.