Systems and methods of instructions to accelerate multiplication of sparse matrices using bitmasks that identify non-zero elements

US11847185B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11847185-B2
Application numberUS-202117485055-A
CountryUS
Kind codeB2
Filing dateSep 24, 2021
Priority dateDec 27, 2018
Publication dateDec 19, 2023
Grant dateDec 19, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising: a first register to store elements from a first matrix that has zero and non-zero values in a compressed format; a second register to store elements from a second matrix; a scheduler circuit to schedule an instruction for execution, the instruction comprising fields to specify the first register, the second register, an accumulation matrix, a destination matrix, indications of a logical matrix position of the elements in at least the first matrix in a non-compressed format, and an opcode to indicate the instruction is a sparse matrix instruction and that execution circuitry including a processing engine is to select a proper subset of elements of the second register from the second matrix as an input into a multiply-accumulator circuit of the processing engine based on the indications, multiply the elements from the first matrix with corresponding elements of the proper subset of elements of the second matrix to generate products, accumulate the products with corresponding elements of the accumulation matrix to produce sums, and store the sums in corresponding elements of the destination matrix; and the execution circuitry, including the processing engine, to execute the instruction according to the opcode, wherein the first matrix has M rows by K columns, the second matrix has K rows by N columns, the accumulation matrix has M rows by N columns, and the instruction includes a suffix to the opcode that when set to a first value is to explicitly specify a first set of K, M, and N values, and when set to a different second value is to explicitly specify a different second set of K, M, and N values. 2. The apparatus of claim 1 , wherein the processing engine comprises a multiplexer, and the multiplexer is to be controlled based on the indications of the logical matrix position of the elements in at least the first matrix in the non-compressed format to select the proper subset of elements of the second register from the second matrix as the input into the multiply-accumulator circuit. 3. The apparatus of claim 2 , wherein the apparatus comprises a plurality of instances of the processing engine. 4. The apparatus of claim 1 , wherein the apparatus supports multithreading. 5. The apparatus of claim 1 , wherein the elements from the first matrix are in the first register in row major layout. 6. The apparatus of claim 1 , wherein the instruction includes a second suffix to the opcode that when set to a first value is to explicitly specify the elements are floating-point format and when set to a different second value is to explicitly specify the elements are integer format. 7. The apparatus of claim 1 , wherein the execution circuitry further comprises an integer execution circuit and a floating-point execution circuit. 8. A method comprising: storing elements from a first matrix that has zero and non-zero values in a compressed format in a first register of an apparatus; storing elements from a second matrix in a second register of the apparatus; scheduling, by a scheduler circuit of the apparatus, an instruction for execution, the instruction comprising fields to specify the first register, the second register, an accumulation matrix, a destination matrix, indications of a logical matrix position of the elements in at least the first matrix in a non-compressed format, and an opcode to indicate the instruction is a sparse matrix instruction and that execution circuitry of the apparatus including a processing engine is to select a proper subset of elements of the second register from the second matrix as an input into a multiply-accumulator circuit of the processing engine based on the indications, multiply the elements from the first matrix with corresponding elements of the proper subset of elements of the second matrix to generate products, accumulate the products with corresponding elements of the accumulation matrix to produce sums, and store the sums in corresponding elements of the destination matrix; and executing, by the execution circuitry including the processing engine, the instruction according to the opcode, wherein the first matrix has M rows by K columns, the second matrix has K rows by N columns, the accumulation matrix has M rows by N columns, and the instruction includes a suffix to the opcode that when set to a first value explicitly specifies a first set of K, M, and N values, and when set to a different second value explicitly specifies a different second set of K, M, and N values. 9. The method of claim 8 , wherein the processing engine comprises a multiplexer, and further comprising controlling the multiplexer based on the indications of the logical matrix position of the elements in at least the first matrix in the non-compressed format to select the proper subset of elements of the second register from the second matrix as the input into the multiply-accumulator circuit. 10. The method of claim 9 , wherein the apparatus comprises a plurality of instances of the processing engine. 11. The method of claim 8 , wherein the apparatus supports multithreading. 12. The method of claim 8 , wherein the elements from the first matrix are in the first register in row major layout. 13. The method of claim 8 , wherein the instruction includes a second suffix to the opcode that when set to a first value explicitly specifies the elements are floating-point format and when set to a different second value explicitly specifies the elements are integer format. 14. The method of claim 8 , wherein the execution circuitry further comprises an integer execution circuit and a floating-point execution circuit. 15. A non-transitory machine-readable medium containing code to which a machine is to respond by: storing elements from a first matrix that has zero and non-zero values in a compressed format in a first register of an apparatus; storing elements from a second matrix in a second register of the apparatus; scheduling, by a scheduler circuit of the apparatus, an instruction for execution, the instruction comprising fields to specify the first register, the second register, an accumulation matrix, a destination matrix, indications of a logical matrix position of the elements in at least the first matrix in a non-compressed format, and an opcode to indicate the instruction is a sparse matrix instruction and that execution circuitry of the apparatus including a processing engine is to select a proper subset of elements of the second register from the second matrix as an input into a multiply-accumulator circuit of the processing engine based on the indications, multiply the elements from the first matrix with corresponding elements of the proper subset of elements of the second matrix to generate products, accumulate the products with corresponding elements of the accumulation matrix to produce sums, and store the sums in corresponding elements of the destination matrix; and executing, by the execution circuitry including the processing engine, the instruction according to the opcode, wherein the first matrix has M rows by K columns, the second matrix has K rows by N columns, the accumulation matrix has M rows by N columns, and the instruction includes a suffix to the opcode that when set to a first value explicitly specifies a first set of K, M, and N values, and when set to a different second value explicitly specifies a different second set of K, M, and N values. 16. The non-transitory machine-readable medium of claim 15 , wherein the processing engine comprises a multiplexer, and the machine is further to respond to the code by controlling the multipl

Assignees

Inventors

Classifications

  • controlled in tandem, e.g. multiplier-accumulator · CPC title

  • Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title

  • G06F17/16Primary

    Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • Arithmetic instructions · CPC title

  • Decoding the operand specifier, e.g. specifier format · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11847185B2 cover?
Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third mat…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 19 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).