SGCNAX: a scalable graph convolutional neural network accelerator with workload balancing

US12493779B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12493779-B2
Application numberUS-202117551876-A
CountryUS
Kind codeB2
Filing dateDec 15, 2021
Priority dateDec 15, 2020
Publication dateDec 9, 2025
Grant dateDec 9, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

We introduce SGCNAX, a scalable GCN accelerator architecture for the high-performance and energy-efficient acceleration of GCNs. Unlike prior GCN accelerators that either employ limited loop optimization techniques, or determine the design variables based on random sampling, we systematically explore the loop optimization techniques for GCN acceleration and provide a flexible GCN dataflow that adapts to different GCN configurations to achieve optimal efficiency. We further provide two hardware-based techniques to address the workload imbalance problem caused by the unbalanced distribution of zeros in GCNs. Specifically, SGCNAX exploits an outer-product-based computation architecture that mitigates the intra-PE (Processing Elements) workload imbalance, and employs a group-and-shuffle approach to mitigate the inter-PE workload imbalance.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A scalable graph convolutional neural network (GCN) accelerator, comprising: an accelerator circuit including: a processing element (PE) array that includes a plurality of PEs, wherein the plurality of PEs each include; a multiply-and-accumulate (MAC) array configured to perform multiplication and accumulation, a sparse matrix buffer configured to store data of a sparse matrix, an input dense matrix buffer configured to store data of a first dense matrix, an output dense matrix buffer configured to store data of a second dense matrix; and a PE control unit configured to: control data reading from the sparse matrix buffer, the input dense matrix buffer, and the output dense matrix buffer to the MAC array, and control data writing from the MAC array to the output dense matrix buffer. 2 . The accelerator of claim 1 , wherein the multiply-and-accumulate (MAC) array is configured to: receive a non-zero scalar and a first position associated with the non-zero scalar from the sparse matrix buffer; receive a first row vector and a second position associated with a first element in the first row vector from the input dense matrix buffer; and multiply each of element values in the first row vector with the non-zero scalar to produce a second row vector. 3 . The accelerator of claim 2 , wherein the MAC array is further configured to: receive a third row vector from the output dense matrix buffer according to a third position; add the second row vector to the third row vector to generate a fourth row vector; and store the fourth row vector to the output dense matrix buffer according to the third position. 4 . The accelerator of claim 2 , wherein: the data of the sparse matrix includes a block of an input feature matrix, and the non-zero scalar is a scalar of the block of the input feature matrix; the data of the first dense matrix includes a block of an weight matrix, and the first row vector is a row vector of the block of the weight matrix; and the MAC array is configured to multiply each of element values in the first row vector of the weight matrix with the non-zero scalar of the input feature matrix to produce a second row vector. 5 . The accelerator of claim 4 , wherein the MAC array is further configured to: receive a third row vector from the output dense matrix buffer according to a third position, wherein the second row vector and the third row vector each are a partial product for an intermediate matrix; add the second row vector to the third row vector to generate a fourth row vector for the intermediate matrix; and store the fourth row vector to the output dense matrix buffer according to the third position. 6 . The accelerator of claim 2 , wherein: the data of the sparse matrix includes a block of an adjacency matrix and, the non-zero scalar is a scalar of the block of the adjacency matrix; the data of the first dense matrix includes a block of an intermediate matrix, and the first row vector is a row vector of the block of the intermediate matrix; and the MAC array is configured to multiply each of element values in the first row vector of the intermediate matrix with the non-zero scalar of the adjacency matrix to produce a second row vector. 7 . The accelerator of claim 6 , wherein the MAC array is further configured to: receive a third row vector from the output dense matrix buffer according to a third position, wherein the second row vector and the third row vector each are a partial product for an output matrix; add the second row vector to the third row vector to generate a fourth row vector for the output matrix; and store the fourth row vector to the output dense matrix buffer according to the third position. 8 . The accelerator of claim 1 , further comprising: a processing unit configured to: receive a graph input to be processed using a graph convolutional neural network having a plurality of layers, determine a partitioning of each of the plurality of layers of the graph convolutional neural network into a sequence of sublayers, and determine a processing order of the sublayers in the sequence; wherein: a PE of the plurality of PEs is configured to process the graph input by processing the sublayers in the sequence according to the determined processing order. 9 . The accelerator of claim 1 , wherein the accelerator circuit further includes: a data dispatcher coupled to the PE array and configured to distribute input data to the PE array; and a permutation network coupled to the PE array and configured to collect output data from the PE array. 10 . The accelerator of claim 9 , wherein: the data dispatcher is configured to group a plurality of rows of a matrix into a plurality of groups according to a density-sorted rank order; the data dispatcher is further configured to map the plurality of groups to the plurality of PEs of the PE array, respectively; and the permutation network is configured to unshuffle outputs from the plurality of PEs of the PE array. 11 . The accelerator of claim 9 , wherein: the accelerator chip further includes an accumulator buffer; and the permutation network is configured to send output data from the PE array to the accumulator buffer. 12 . The accelerator of claim 1 , further comprising: an off-chip memory external to the accelerator circuit; wherein the accelerator circuit further includes: a first buffer that is inside the accelerator chip; and a circuit control unit configured to control the reading of data from the off-chip memory to the first buffer.

Assignees

Inventors

Classifications

  • Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • G06F7/5443Primary

    Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12493779B2 cover?
We introduce SGCNAX, a scalable GCN accelerator architecture for the high-performance and energy-efficient acceleration of GCNs. Unlike prior GCN accelerators that either employ limited loop optimization techniques, or determine the design variables based on random sampling, we systematically explore the loop optimization techniques for GCN acceleration and provide a flexible GCN dataflow that …
Who is the assignee on this patent?
Univ George Washington
What technology area does this patent fall under?
Primary CPC classification G06F7/5443. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 09 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).