Compression-encoding scheduled inputs for matrix computations
US-2020159812-A1 · May 21, 2020 · US
US12008067B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12008067-B2 |
| Application number | US-202117527324-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 16, 2021 |
| Priority date | Sep 5, 2019 |
| Publication date | Jun 11, 2024 |
| Grant date | Jun 11, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus to facilitate acceleration of matrix multiplication operations. The apparatus comprises a systolic array including matrix multiplication hardware to perform multiply-add operations on received matrix data comprising data from a plurality of input matrices and sparse matrix acceleration hardware to detect zero values in the matrix data and perform one or more optimizations on the matrix data to reduce multiply-add operations to be performed by the matrix multiplication hardware.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: a systolic multiplier, including: a first set of first in first out (FIFO) buffers to store data in a first input matrix; a second set of FIFO buffers to store data in a second input matrix; a plurality of processing elements (PEs) to receive the matrix data from the first set of FIFO buffers and the second set of FIFO buffers and perform multiply-add operations on first input matrix and the second input matrix; a plurality of storage elements to locally store intermediate matrix multiplication values; and sparse matrix acceleration circuitry to detect zero values in the matrix data and perform one or more optimizations on the matrix data to reduce the multiply-add operations to be performed by the matrix systolic multiplier, including swapping rows with other rows in a sub-matrix for each of a plurality of sub-matrices of the first input matrix. 2. The apparatus of claim 1 , wherein the swapping comprises swapping a first row having the predetermined threshold zero values with a second row not having the predetermined threshold zero values to enable the first row to be adjacent to a third row having the predetermined threshold zero values. 3. The apparatus of claim 1 , further comprising compression circuitry to compress the matrix data. 4. The apparatus of claim 3 , wherein compressing the matrix data comprises generating packed matrix data by removing zero values from the matrix data and generating an indicator vector to identify the location of the zero values in the packed matrix data. 5. The apparatus of claim 4 , wherein the sparse matrix acceleration circuitry receives the compressed matrix data and identifies the zero values in the compressed matrix data based on the indicator vector. 6. The apparatus of claim 1 , wherein the sparse matrix acceleration circuitry optimizing the matrix data further comprises swapping rows in each of a plurality of sub-matrices of a second of the plurality of input matrices to add to results of a multiplication operation of the first matrix and a third of the plurality of input matrices. 7. The apparatus of claim 6 , wherein the sparse matrix acceleration circuitry optimizing the matrix data further comprises performing a reverse swapping on an output matrix. 8. The apparatus of claim 7 , wherein the sparse matrix acceleration circuitry optimizing the matrix data further comprises performing row adjustments in the first matrix. 9. The apparatus of claim 8 , wherein performing the row adjustments comprises shifting all non-zero values of a row having more than the predetermined number of zero values. 10. The apparatus of claim 8 , wherein performing the row adjustments comprises combining adjacent rows having more than the predetermined number of zero values. 11. A method comprising: receiving data in a first input matrix at a systolic multiplier; receiving data in a second input matrix at the systolic multiplier; detecting, at the systolic multiplier, zero values in the first matrix data; performing, at the systolic multiplier, one or more optimizations on the matrix data to reduce multiply-add operations to be performed, including swapping rows with other rows in a sub-matrix for each of a plurality of sub-matrices of the first input matrix; and performing, at the systolic multiplier, the multiply-add operations on the optimized matrix data at matrix multiplication circuitry. 12. The method of claim 11 , wherein the swapping comprises swapping a first row having the predetermined threshold zero values with a second row not having the predetermined threshold zero values to enable the first row to be adjacent to a third row having the predetermined threshold zero values. 13. The method of claim 11 , wherein performing the one or more optimizations on the matrix data comprises swapping rows in each of a plurality of sub-matrices of a first of the plurality of input matrices to achieve a maximum number of adjacent rows having a predetermined threshold zero values. 14. The method of claim 11 , wherein performing the one or more optimizations on the matrix data further comprises performing row adjustments in the first matrix. 15. The method of claim 14 , wherein performing the one or more optimizations on the matrix data further comprises performing a reverse swapping on an output matrix resulting from the multiply-add operations. 16. A graphics processing unit (GPU) comprising: a first set of first in first out (FIFO) buffers to store data in a first input matrix; a second set of FIFO buffers to store data in a second input matrix; a plurality of processing elements (PEs) to receive the matrix data from the first set of FIFO buffers and the second set of FIFO buffers and perform multiply-add operations on first input matrix and the second input matrix; and a plurality of storage elements to locally store intermediate matrix multiplication values; and sparse matrix acceleration circuitry to detect zero values in the matrix data and perform one or more optimizations on the matrix data to reduce multiply-add operations to be performed by the matrix systolic multiplier, including swapping rows with other rows in a sub-matrix for each of a plurality of sub-matrices of the first input matrix. 17. The GPU of claim 16 , wherein the swapping comprises swapping a first row having the predetermined threshold zero values with a second row not having the predetermined threshold zero values to enable the first row to be adjacent to a third row having the predetermined threshold zero values. 18. The GPU of claim 16 , wherein performing the one or more optimizations on the matrix data comprises swapping rows in each of a plurality of sub-matrices of a first of the plurality of input matrices to achieve a maximum number of adjacent rows having a predetermined threshold zero values. 19. The GPU of claim 16 , wherein performing the one or more optimizations on the matrix data further comprises performing row adjustments in the first matrix. 20. The GPU of claim 19 , wherein performing the one or more optimizations on the matrix data further comprises performing a reverse swapping on an output matrix resulting from the multiply-add operations.
Quantised networks; Sparse networks; Compressed networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
Systolic array · CPC title
using buffers · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.