Matrix multiplication acceleration of sparse matrices using column folding and squeezing

US10620951B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10620951-B2
Application numberUS-201816016278-A
CountryUS
Kind codeB2
Filing dateJun 22, 2018
Priority dateJun 22, 2018
Publication dateApr 14, 2020
Grant dateApr 14, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed embodiments relate to sparse matrix multiplication (SMM) acceleration using column folding and squeezing. In one example, a processor, in response to a SMM instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, uses execution circuitry to pack the second matrix by replacing one or more zero-valued elements with non-zero elements yet to be processed, each of the replaced elements further including a field to identify its logical position within the second matrix, and, the execution circuitry further to, for each non-zero element at row M and column K of the specified first matrix, generate a product of the element and each corresponding non-zero element at row K, column N of the packed second matrix, and accumulate each generated product with a previous value of a corresponding element at row M and column N of the specified output matrix.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: fetch and decode circuitry to fetch and decode a sparse matrix multiplication (SMM) instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, the fetch circuitry further to fetch and store elements of the first and second matrices into a register file; and execution circuitry, responsive to the SMM instruction, to pack the second matrix stored in the register file by replacing one or more zero-valued elements in a column N of the second matrix with non-zero elements yet to be processed, each of the replaced elements further including a field to identify its logical position within the second matrix, and the execution circuitry further to generate a product of a non-zero element at column K of the first matrix and a corresponding non-zero element at row K of the packed second matrix for each non-zero element in row M of the first matrix and each non-zero element in column N of the packed second matrix, and accumulate each generated product with a previous value of a corresponding element at row M and column N of the output matrix. 2. The processor of claim 1 , wherein the execution circuitry, for each row K of the second matrix, is to determine whether the row contains any zero-valued elements, and, if so, determine whether the row contains any non-zero elements yet to be processed from the zero-valued element, and, if so, for each zero-valued element having a non-zero element yet to be processed, fold the non-zero element into the zero-valued element. 3. The processor of claim 1 , wherein the execution circuitry, for each column N of the second matrix, is to determine whether the column contains any zero-valued elements, and, if so, determine whether any of P elements of a subsequent column is a non-zero value, and, for each zero-valued element having a non-zero element in the subsequent column, squeeze the non-zero element into the zero-valued element. 4. The processor of claim 1 , wherein each of the elements stored in the register file includes a field to specify whether it has a zero value, and wherein the execution circuitry is to use the field when determining whether the element has a zero value. 5. The processor of claim 1 , wherein the execution circuitry is to avoid generating any products of elements having zero values. 6. The processor of claim 1 , wherein the execution circuitry is to comprise a processing array of (X×Y) processing units, wherein X is less than M and Y is less than N, the execution circuitry to use the processing array iteratively over a plurality of clock cycles to perform the same processing as an actual, physical array of (M×N) processing units. 7. The processor of claim 1 , wherein the execution circuitry is to comprise a processing array of (X×Y) processing units, wherein X is less than M and Y is less than N, the execution circuitry to cascade a plurality of instances of the processing array to perform the same processing as an actual, physical array of (M×N) processing units. 8. A method executed by a processor, the method comprising: fetching and decoding, using fetch and decode circuitry, a sparse matrix multiplication (SMM) instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, the fetch circuitry further fetching and storing elements of the first and second matrices into a register file; and responding, using execution circuitry, to the SMM instruction by packing the second matrix stored in the register file by replacing one or more zero-valued elements in a column N of the second matrix with non-zero elements yet to be processed, each replaced element to include a field to identify its logical position within the second matrix, the execution circuitry further to generate a product of a non-zero element at column K of the first matrix and a corresponding non-zero element at row K of the packed second matrix for each non-zero element in row M of the first matrix and each non-zero element in column N of the packed second matrix, and accumulate each generated product with a previous value of a corresponding element at row M and column N of the output matrix. 9. The method of claim 8 , wherein the execution circuitry, for each row K of the second matrix, is to determine whether the row contains any zero-valued elements, and, if so, determine whether the row contains any non-zero elements yet to be processed from the zero-valued element, and, if so, for each zero-valued element having a non-zero element yet to be processed, fold the non-zero element into the zero-valued element. 10. The method of claim 8 , wherein the execution circuitry, for each column N of the second matrix, is to determine whether the column contains any zero-valued elements, and, if so, determine whether any of P elements of a subsequent column is a non-zero value, and, for each zero-valued element having a non-zero element in the subsequent column, squeeze the non-zero element into the zero-valued element. 11. The method of claim 8 , wherein each of the elements stored in the register file includes a field to specify whether it has a zero value, and wherein the execution circuitry is to use the field when determining whether the element has a zero value. 12. The method of claim 8 , wherein the execution circuitry is to avoid generating any products of elements having zero values. 13. The method of claim 8 , wherein the execution circuitry comprises a processing array of (X×Y) processing units, X being less than M and Y being less than N, the execution circuitry using the processing array iteratively over a plurality of clock cycles to perform the same processing as an actual, physical array of (M×N) processing units. 14. The method of claim 8 , wherein the execution circuitry comprises a processing array of (X×Y) processing units, X being less than M and Y being less than N, the execution circuitry cascading a plurality of instances of the processing array to perform the same processing as an actual, physical array of (M×N) processing units. 15. A system comprising a memory and a processor, the processor comprising: fetch and decode circuitry to fetch and decode a sparse matrix multiplication (SMM) instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, the fetch circuitry further to fetch and store elements of the first and second matrices into a register file; and execution circuitry, responsive to the SMM instruction, to pack the second matrix stored in the register file by replacing one or more zero-valued elements in a column N of the second matrix with non-zero elements yet to be processed, each replaced element to include a field to identify its logical position within the second matrix, the execution circuitry further to generate a product of a non-zero element at column K of the first matrix and a corresponding non-zero element at row K of the packed second matrix for each non-zero element in row M of the first matrix and each non-zero element in column N of the packed second matrix, and accumulate each generated product with a previous value of a corresponding element at row M and column N of the output matrix. 16. The system of claim 15 , wherein the execution circuitry, for each row K of the second matrix, is to determine whether the row contains any zero-valued elements, and, if so, determine whether the row contains any non-zero elements yet to be processed from the zero-valued element, and, if so, for each zero-valued element having a n

Assignees

Inventors

Classifications

  • Learning methods · CPC title

  • G06F9/3001Primary

    Arithmetic instructions · CPC title

  • Decoding the operand specifier, e.g. specifier format · CPC title

  • Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • Instruction prefetching · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10620951B2 cover?
Disclosed embodiments relate to sparse matrix multiplication (SMM) acceleration using column folding and squeezing. In one example, a processor, in response to a SMM instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, uses execution circuitry to pack the second matrix by replacing one or more zero-valued elements with no…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3001. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 14 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).