Efficient sparse matrix-vector multiplication on parallel processors

US2016140084A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016140084-A1
Application numberUS-201414542003-A
CountryUS
Kind codeA1
Filing dateNov 14, 2014
Priority dateNov 14, 2014
Publication dateMay 19, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of multiplication of a sparse matrix and a vector to obtain a new vector and a system for implementing the method are claimed. Embodiments of the method are intended to optimize the performance of sparse matrix-vector multiplication in highly parallel processors, such as GPUs. The sparse matrix is stored in compressed sparse row (CSR) format.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of multiplication of a sparse matrix and a vector to obtain a new vector, the sparse matrix being stored in compressed sparse row (CSR) format, the method to be executed on a parallel processor, the method comprising: partitioning the matrix into blocks of consecutive rows; and for each of said blocks having more than a minimum number of rows: determining a number of non-zero matrix elements in the block; executing a first process for sparse-matrix vector multiplication on rows of the block if a number of non-zero matrix elements in the block is less than or equal to a maximum; and executing a second process for sparse-matrix vector multiplication, distinct from the first process, if the number of non-zero matrix elements is greater than the maximum. 2 . The method of claim 1 , wherein the determining a number of non-zero matrix elements in the block, and the executing of the first process or the executing of the second process are performed in parallel by more than one workgroup, each workgroup performing on a different set of blocks. 3 . The method of claim 1 , further comprising executing the second process on a block having a number of rows that is less than or equal to the minimum number of rows. 4 . The method of claim 1 , wherein executing the first process comprises: streaming non-zero elements in each block to contiguous locations in a local memory; multiplying each non-zero element in each block with a corresponding element of the vector; and adding results of the multiplying from each row of each block to obtain corresponding elements of the new vector. 5 . The method of claim 4 , wherein the streaming comprises using consecutive threads in a wavefront. 6 . The method of claim 4 , wherein the maximum is determined by a capacity of the local memory. 7 . The method of claim 4 , wherein the multiplying and adding are performed in parallel for all rows in each block. 8 . The method of claim 4 , wherein the adding for each row is performed by more than one thread in a wavefront. 9 . The method of claim 1 , wherein the second process comprises a CSR-vector process. 10 . The method of claim 1 , further comprising using the second process to determine an element of the new vector corresponding to a row in the block if said row is the only row in said block and the number of non-zero elements in said row exceeds the maximum. 11 . The method of claim 1 , wherein the partitioning of the matrix comprises deriving a rows block buffer, the rows block buffer indicating a number of rows in each block. 12 . The method of claim 11 , wherein the determining a number of non-zero matrix elements in the block comprises: determining a start row and a stop row for each block from the rows block buffer; determining two values in a row delimiter buffer corresponding to the start row and to the stop row; and subtracting said two values to determine the number of non-zero elements in each block. 13 . The method of claim 1 , wherein the multiplication of a sparse matrix and a vector are used in operation of a stored application, results of said operation being stored in a memory. 14 . The method of claim 1 , wherein the multiplication of a sparse matrix and a vector are used in operation of a stored application, results of said operation being visually displayed on a display device. 15 . A system configured to multiply a sparse matrix and a vector to obtain a new vector, the system comprising: a parallel processor; a storage configured to exchange information with the processor and store the vector and the sparse matrix, the sparse matrix being stored in compressed sparse row (CSR) format; and a local memory configured to exchange information with the processor; wherein the parallel processor is configured to implement a method of multiplication of the sparse matrix and the vector; the method comprising: partitioning the matrix into blocks of consecutive rows; and for each of said blocks having more than a minimum number of rows: determining a number of non-zero matrix elements in the block; executing a first process for sparse-matrix vector multiplication on rows of the block if a number of non-zero matrix elements in the block is less than or equal to a maximum; and executing a second process for sparse-matrix vector multiplication, distinct from the first process, if the number of non-zero matrix elements is greater than the maximum. 16 . The system of claim 15 , wherein the processor is configured to perform the determining a number of non-zero matrix elements in the block, and the executing of the first process or the executing of the second process in parallel in more than one workgroup, each workgroup performing on a different set of blocks. 17 . The system of claim 15 , wherein the processor is further configured to execute the second process on a block having a number of rows that is less than or equal to the minimum number of rows. 18 . The system of claim 15 , wherein the processor is configured to execute the first process by: streaming non-zero elements in each block to contiguous locations in the local memory; multiplying each non-zero element in each block with a corresponding element of the vector; and adding results of the multiplying from each row of each block to obtain corresponding elements of the new vector. 19 . The system of claim 18 , wherein the processor is configured to stream non-zero elements in each block to contiguous locations in the local memory by using consecutive threads in a wavefront. 20 . The system of claim 18 , wherein the processor is configured to perform the multiplying and the adding in parallel for all rows in each block. 21 . The system of claim 18 , wherein the processor is configured to perform the adding for each row in more than one thread in a wavefront. 22 . The system of claim 15 , wherein a capacity of the local memory determines the maximum. 23 . The system of claim 15 , wherein the local memory is integrated with the processor. 24 . The system of claim 15 , wherein the second process comprises a CSR-vector process. 25 . The system of claim 15 , wherein the processor is configured to use the second process to determine an element of the new vector corresponding to a row in the block if said row is the only row in said block and the number of non-zero elements in said row exceeds the maximum. 26 . The system of claim 15 wherein the processor is configured to store the new vector in the storage. 27 . The system of claim 15 , wherein the processor is configured to perform the partitioning of the matrix by deriving a rows block buffer, the rows block buffer indicating a number of rows in each block. 28 . The system of claim 27 , wherein the processor is configured to determine a number of rows in each block by: determining a start row and a stop row for each block from the rows block buffer; determining two values in a row delimiter buffer corresponding to the start row and to the stop row; and subtracting said two values to determine the number of non-zero elements in each block. 29 . The system of claim 15 , wherein the processor is configured to: operate an application stored in the storage, wherein the multiplication of the sparse matrix and the vector are used in the a

Assignees

Inventors

Classifications

  • G06F17/16Primary

    Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016140084A1 cover?
A method of multiplication of a sparse matrix and a vector to obtain a new vector and a system for implementing the method are claimed. Embodiments of the method are intended to optimize the performance of sparse matrix-vector multiplication in highly parallel processors, such as GPUs. The sparse matrix is stored in compressed sparse row (CSR) format.
Who is the assignee on this patent?
Advanced Micro Devices Inc
What technology area does this patent fall under?
Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 19 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).