Hardware environment and method of performing matrix multiplication in artificial intelligence applications

US10877812B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10877812-B2
Application numberUS-201816123098-A
CountryUS
Kind codeB2
Filing dateSep 6, 2018
Priority dateSep 6, 2018
Publication dateDec 29, 2020
Grant dateDec 29, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A plurality of hardware accelerators are interconnected and include a special processing unit and accelerator memory. At least one host computer is coupled to each of the plurality of hardware accelerators and includes a general processing unit and host memory. The plurality of hardware accelerators exchange data in a ring communication pattern in computing a linear layer of a neural network.

First claim

Opening claim text (preview).

We claim: 1. A system comprising: a plurality of hardware accelerators interconnected via an accelerator interconnect, each of the plurality of hardware accelerators comprising a special processing unit and accelerator memory; and at least one host computer coupled to each of the plurality of hardware accelerators via an accelerator link, the at least one host computer comprising a general processing unit and host memory, the plurality of hardware accelerators exchanging data in a ring communication pattern in computing a linear layer of a neural network, wherein each of the plurality of hardware accelerators, in parallel, reads a data block stored on a neighbor accelerator in the ring communication pattern employing a consistently same direction. 2. A system comprising: a plurality of hardware accelerators interconnected via an accelerator interconnect, each of the plurality of hardware accelerators comprising a special processing unit and accelerator memory; and at least one host computer coupled to each of the plurality of hardware accelerators via an accelerator link, the at least one host computer comprising a general processing unit and host memory, the plurality of hardware accelerators exchanging data in a ring communication pattern in computing a linear layer of a neural network, wherein input data comprising a matrix is partitioned into P parts, wherein P represents a number of the hardware accelerators, wherein a hardware accelerator in the plurality of hardware accelerators stores one part of the P parts in the accelerator memory associated with the hardware accelerator, and wherein the plurality of hardware accelerators exchanging data in a ring communication pattern comprises the hardware accelerator transferring a sub-block of the one part it stores to another hardware accelerator in the plurality of hardware accelerators. 3. The system of claim 2 , wherein the hardware accelerator transfers the sub-block in parallel with performing a matrix computation. 4. The system of claim 2 , wherein the plurality of hardware accelerators exchanging data in a ring communication pattern comprises the hardware accelerator receiving a sub-block of a part stored in another one of the plurality of hardware accelerators from said another one of the plurality of hardware accelerators. 5. The system of claim 4 , wherein only (P−1)/P partitions are streamed into and out of the hardware accelerator. 6. The system of claim 4 , wherein only (P−1)/P partitions are streamed into and out of said another one of the plurality of hardware accelerators. 7. The system of claim 2 , wherein the input data is initially stored on the host computer entirely, and the P parts are distributed to the hardware accelerators. 8. The system of claim 2 , wherein the data exchanged comprises at least a part of a flattened matrix resulting at a fully connected layer of a convolutional neural network. 9. The system of claim 2 , wherein the plurality of hardware accelerators comprises P number of hardware accelerators, and wherein the system performs a general matrix to matrix multiplication (GEMM) wherein at least three matrices comprising a first matrix, a second matrix and a third matrix are involved, wherein the first matrix and the third matrix are split by row into P partitions and stored by column, each partition of the first matrix is stored on a different accelerator of the P accelerators and each partition of the third matrix is stored on the different accelerator of the P accelerators, and wherein the second matrix is split by row into P partitions and each of the P partitions of the second matrix is stored by column on the different accelerator of the P accelerators. 10. The system of claim 9 , wherein each of the P accelerators in parallel: multiplies one block of the second matrix stored locally by corresponding columns of the partition of the first matrix stored locally and accumulates a result into a local partition of the third matrix; and reads a block of the second matrix stored on its neighbor accelerator in the ring communication pattern and multiplies the block of the second matrix by the corresponding columns of the partition of the first matrix stored locally and accumulates a result into the local partition of the third matrix, each of the P hardware accelerators repeating the reading of the block of the second matrix stored on its neighbor accelerator in the ring communication pattern and multiplying the block of the second matrix by the corresponding columns of the partition of the first matrix stored locally and accumulating the result into the local partition of the third matrix, until all partitions of the second matrix have taken part in the multiplying. 11. The system of claim 9 , wherein the P partitions of the second matrix are further split into n sub-blocks, and the n sub-blocks are operated at a time in a pipelined fashion. 12. The system of claim 2 , wherein the plurality of hardware accelerators comprises P number of hardware accelerators, and wherein the system performs a general matrix to matrix multiplication (GEMM) wherein at least three matrices comprising a first matrix, a second matrix and a third matrix are involved, wherein the first matrix and the third matrix are split by row into P partitions and stored by column, all partitions of the first matrix are stored in the host memory, each partition of the third matrix is stored on a different accelerator of the P accelerators, and wherein the second matrix is split by row into P partitions and each of the P partitions of the second matrix is stored by column on the different accelerator of the P accelerators. 13. The system of claim 12 , wherein each of the P accelerators in parallel: fetches a block of the first matrix from the host memory corresponding to a block of the second matrix stored in the respective accelerator; multiplies one block of the second matrix stored locally by corresponding columns of the fetched block of the first matrix, and accumulates results in the corresponding partition of the third matrix stored locally; and reads a block of the second matrix stored in a neighboring accelerator in the ring communication pattern and fetches a next block of the first matrix, each of the P accelerators repeating the multiplying, reading and fetching until all partitions of the second matrix have taken part in the multiplying. 14. The system of claim 2 , wherein the plurality of hardware accelerators comprises P number of hardware accelerators, and wherein the system performs a general matrix to matrix multiplication (GEMM) wherein at least three matrices comprising a first matrix, a second matrix and a third matrix are involved, wherein the first matrix and the third matrix are split by row into P partitions and stored by column, all partitions of the first matrix and the third matrix are stored in the host memory, and wherein the second matrix is split by row into P partitions and all of the P partitions of the second matrix is stored by column in the host memory. 15. The system of claim 14 , wherein each of the P hardware accelerators in parallel fetches a block of the third matrix wherein all of the P hardware accelerators work on a separate partition of the third matrix; each of the P hardware accelerators in parallel fetches a block of the first matrix and a block of the second matrix from the host computer; each of the P accelerators in parallel multiplies the block of the second matrix by corresponding columns of the partition of the first matrix fetched from the host computer and accumulates a result into the l

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • G06N3/063Primary

    using electronic means · CPC title

  • the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10877812B2 cover?
A plurality of hardware accelerators are interconnected and include a special processing unit and accelerator memory. At least one host computer is coupled to each of the plurality of hardware accelerators and includes a general processing unit and host memory. The plurality of hardware accelerators exchange data in a ring communication pattern in computing a linear layer of a neural network.
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 29 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).