Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit
US-10606651-B2 · Mar 31, 2020 · US
US10877812B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10877812-B2 |
| Application number | US-201816123098-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 6, 2018 |
| Priority date | Sep 6, 2018 |
| Publication date | Dec 29, 2020 |
| Grant date | Dec 29, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A plurality of hardware accelerators are interconnected and include a special processing unit and accelerator memory. At least one host computer is coupled to each of the plurality of hardware accelerators and includes a general processing unit and host memory. The plurality of hardware accelerators exchange data in a ring communication pattern in computing a linear layer of a neural network.
Opening claim text (preview).
We claim: 1. A system comprising: a plurality of hardware accelerators interconnected via an accelerator interconnect, each of the plurality of hardware accelerators comprising a special processing unit and accelerator memory; and at least one host computer coupled to each of the plurality of hardware accelerators via an accelerator link, the at least one host computer comprising a general processing unit and host memory, the plurality of hardware accelerators exchanging data in a ring communication pattern in computing a linear layer of a neural network, wherein each of the plurality of hardware accelerators, in parallel, reads a data block stored on a neighbor accelerator in the ring communication pattern employing a consistently same direction. 2. A system comprising: a plurality of hardware accelerators interconnected via an accelerator interconnect, each of the plurality of hardware accelerators comprising a special processing unit and accelerator memory; and at least one host computer coupled to each of the plurality of hardware accelerators via an accelerator link, the at least one host computer comprising a general processing unit and host memory, the plurality of hardware accelerators exchanging data in a ring communication pattern in computing a linear layer of a neural network, wherein input data comprising a matrix is partitioned into P parts, wherein P represents a number of the hardware accelerators, wherein a hardware accelerator in the plurality of hardware accelerators stores one part of the P parts in the accelerator memory associated with the hardware accelerator, and wherein the plurality of hardware accelerators exchanging data in a ring communication pattern comprises the hardware accelerator transferring a sub-block of the one part it stores to another hardware accelerator in the plurality of hardware accelerators. 3. The system of claim 2 , wherein the hardware accelerator transfers the sub-block in parallel with performing a matrix computation. 4. The system of claim 2 , wherein the plurality of hardware accelerators exchanging data in a ring communication pattern comprises the hardware accelerator receiving a sub-block of a part stored in another one of the plurality of hardware accelerators from said another one of the plurality of hardware accelerators. 5. The system of claim 4 , wherein only (P−1)/P partitions are streamed into and out of the hardware accelerator. 6. The system of claim 4 , wherein only (P−1)/P partitions are streamed into and out of said another one of the plurality of hardware accelerators. 7. The system of claim 2 , wherein the input data is initially stored on the host computer entirely, and the P parts are distributed to the hardware accelerators. 8. The system of claim 2 , wherein the data exchanged comprises at least a part of a flattened matrix resulting at a fully connected layer of a convolutional neural network. 9. The system of claim 2 , wherein the plurality of hardware accelerators comprises P number of hardware accelerators, and wherein the system performs a general matrix to matrix multiplication (GEMM) wherein at least three matrices comprising a first matrix, a second matrix and a third matrix are involved, wherein the first matrix and the third matrix are split by row into P partitions and stored by column, each partition of the first matrix is stored on a different accelerator of the P accelerators and each partition of the third matrix is stored on the different accelerator of the P accelerators, and wherein the second matrix is split by row into P partitions and each of the P partitions of the second matrix is stored by column on the different accelerator of the P accelerators. 10. The system of claim 9 , wherein each of the P accelerators in parallel: multiplies one block of the second matrix stored locally by corresponding columns of the partition of the first matrix stored locally and accumulates a result into a local partition of the third matrix; and reads a block of the second matrix stored on its neighbor accelerator in the ring communication pattern and multiplies the block of the second matrix by the corresponding columns of the partition of the first matrix stored locally and accumulates a result into the local partition of the third matrix, each of the P hardware accelerators repeating the reading of the block of the second matrix stored on its neighbor accelerator in the ring communication pattern and multiplying the block of the second matrix by the corresponding columns of the partition of the first matrix stored locally and accumulating the result into the local partition of the third matrix, until all partitions of the second matrix have taken part in the multiplying. 11. The system of claim 9 , wherein the P partitions of the second matrix are further split into n sub-blocks, and the n sub-blocks are operated at a time in a pipelined fashion. 12. The system of claim 2 , wherein the plurality of hardware accelerators comprises P number of hardware accelerators, and wherein the system performs a general matrix to matrix multiplication (GEMM) wherein at least three matrices comprising a first matrix, a second matrix and a third matrix are involved, wherein the first matrix and the third matrix are split by row into P partitions and stored by column, all partitions of the first matrix are stored in the host memory, each partition of the third matrix is stored on a different accelerator of the P accelerators, and wherein the second matrix is split by row into P partitions and each of the P partitions of the second matrix is stored by column on the different accelerator of the P accelerators. 13. The system of claim 12 , wherein each of the P accelerators in parallel: fetches a block of the first matrix from the host memory corresponding to a block of the second matrix stored in the respective accelerator; multiplies one block of the second matrix stored locally by corresponding columns of the fetched block of the first matrix, and accumulates results in the corresponding partition of the third matrix stored locally; and reads a block of the second matrix stored in a neighboring accelerator in the ring communication pattern and fetches a next block of the first matrix, each of the P accelerators repeating the multiplying, reading and fetching until all partitions of the second matrix have taken part in the multiplying. 14. The system of claim 2 , wherein the plurality of hardware accelerators comprises P number of hardware accelerators, and wherein the system performs a general matrix to matrix multiplication (GEMM) wherein at least three matrices comprising a first matrix, a second matrix and a third matrix are involved, wherein the first matrix and the third matrix are split by row into P partitions and stored by column, all partitions of the first matrix and the third matrix are stored in the host memory, and wherein the second matrix is split by row into P partitions and all of the P partitions of the second matrix is stored by column in the host memory. 15. The system of claim 14 , wherein each of the P hardware accelerators in parallel fetches a block of the third matrix wherein all of the P hardware accelerators work on a separate partition of the third matrix; each of the P hardware accelerators in parallel fetches a block of the first matrix and a block of the second matrix from the host computer; each of the P accelerators in parallel multiplies the block of the second matrix by corresponding columns of the partition of the first matrix fetched from the host computer and accumulates a result into the l
Combinations of networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
using electronic means · CPC title
the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.