Systolic array having support for output sparsity

US12399685B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12399685-B2
Application numberUS-202117304803-A
CountryUS
Kind codeB2
Filing dateJun 25, 2021
Priority dateJun 25, 2021
Publication dateAug 26, 2025
Grant dateAug 26, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.

First claim

Opening claim text (preview).

What is claimed is: 1. A processing apparatus including: a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to: perform processing operations on input matrix elements based on output sparsity metadata, the output sparsity metadata to indicate to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix, wherein the output sparsity metadata is independent of input sparsity of the input matrix elements, the multiple processing elements are to generate a sparse output matrix in a compact format, and the output sparsity metadata is to enable de-compaction of the sparse output matrix via insertion of a zero value for a bypassed row of elements. 2. The processing apparatus as in claim 1 , comprising circuitry to decode an instruction into a decoded instruction, the decoded instruction including multiple sub-instructions. 3. The processing apparatus as in claim 1 , wherein the multiple processing elements include a processing element having a first source input associated with an accumulator value, a second source input associated with the first matrix, and multiple third source inputs associated with the second matrix. 4. The processing apparatus as in claim 3 , wherein the output sparsity metadata is to indicate which of the multiple third source inputs to multiply with the second source input. 5. The processing apparatus as in claim 4 , wherein the output sparsity metadata is to indicate to multiply at least one of the multiple third source inputs. 6. The processing apparatus as in claim 5 , wherein the multiple third source inputs include three third source inputs and the output sparsity metadata includes a bit associated with each of the three third source inputs. 7. The processing apparatus as in claim 5 , wherein the processing element is associated with a first processing channel and each pipeline stage of the multiple pipeline stages includes multiple processing channels. 8. The processing apparatus as in claim 1 , wherein, in a first processing cycle, the output sparsity metadata is to indicate to a first processing element to multiply the second row of elements of the second matrix and bypass multiplication of the first row of elements of the second matrix and a third row of elements of the second matrix and, based on the output sparsity metadata, the first processing element is to multiply the second row of elements of the second matrix with the column of matrix elements of the first matrix. 9. The processing apparatus as in claim 8 , wherein, in a second processing cycle, the output sparsity metadata is to indicate to the first processing element to bypass multiplication of the third row of elements of the second matrix and to multiply each of the second row of elements of the second matrix and a fourth row of elements of the second matrix with the column of matrix elements of the first matrix, and, based on the output sparsity metadata and in response to a determination that the second row of elements of the second matrix was multiplied in the first processing cycle, the first processing element is to multiply the fourth row of elements of the second matrix with the column of matrix elements of the first matrix. 10. A method comprising: fetching an instruction at a processing resource of a graphics processor to perform operations associated with a matrix instruction that specifies metadata for output sparsity; decoding the instruction into a decoded instruction; reading operand data for the decoded instruction from a register file of the processing resource, the operand data including matrix elements and the metadata for output sparsity; executing the decoded instruction via a matrix accelerator by performing multiply-accumulate operations on matrix elements selected based on the metadata for output sparsity; and writing output of the multiply-accumulate operations to the register file, wherein the metadata for output sparsity is independent of input sparsity of the matrix elements, output of the multiply-accumulate operations includes a sparse output matrix in a compact format, and the metadata for output sparsity enables de-compaction of the sparse output matrix via insertion of a zero value for a bypassed row of elements. 11. The method as in claim 10 , comprising, in a first processing cycle based on the metadata for output sparsity: multiplying via first processing element, a second row of elements of a second matrix and bypassing multiplication of a first row of elements of the second matrix and a third row of elements of the second matrix; and multiplying the second row of elements of the second matrix with a column of matrix elements of a first matrix. 12. The method as in claim 10 , wherein decoding the instruction into the decoded instruction includes decoding the instruction into multiple sub-instructions and executing the decoded instruction includes executing the multiple sub-instructions. 13. The method as in claim 10 , wherein executing the decoded instruction via the matrix accelerator includes: receiving a first set of matrix elements at a first processing element of the matrix accelerator; receiving multiple second sets of matrix elements at the first processing element of the matrix accelerator; selecting a second set of matrix elements from the multiple second sets of matrix elements using the metadata for output sparsity; and performing a multiply-accumulate operation at the first processing element on the first set of matrix elements and the second set of matrix elements selected using the metadata for output sparsity. 14. The method as in claim 13 , further comprising outputting a result of the multiply-accumulate operation and the metadata for output sparsity to a second processing element of the matrix accelerator. 15. The method as in claim 14 , wherein the matrix accelerator includes a systolic array including multiple pipeline stages, the first processing element is associated with a first pipeline stage, and the second processing element is associated with a second pipeline stage. 16. A system comprising: a memory device; and a graphics processor coupled to the memory device, the graphics processor comprising a general-purpose parallel processing engine including: a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to: perform processing operations on input matrix elements based on output sparsity metadata, the output sparsity metadata to indicate to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix, wherein the output sparsity metadata is independent of input sparsity of the input matrix elements, the multiple processing elements are to generate a sparse output matrix in a compact format, and the output sparsity metadata is to enable de-compaction of the sparse output matrix via insertion of a zero value for a bypassed row of elem

Assignees

Inventors

Classifications

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • using a mask · CPC title

  • G06F17/16Primary

    Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • Systolic arrays · CPC title

  • Multiplying only · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12399685B2 cover?
A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to …
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 26 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).