Distributed matrix multiplication for neural networks
US-10169296-B2 · Jan 1, 2019 · US
US11748625B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11748625-B2 |
| Application number | US-201615395675-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 30, 2016 |
| Priority date | Dec 30, 2016 |
| Publication date | Sep 5, 2023 |
| Grant date | Sep 5, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a matrix operation may be performed using a plurality of input matrices, wherein the matrix operation is associated with one or more convolution operations. The plurality of input matrices may be partitioned into a plurality of input partitions, wherein the plurality of input matrices is partitioned based on a number of available processing elements. The plurality of input partitions may be distributed among a plurality of processing elements, wherein each input partition is distributed to a particular processing element of the plurality of processing elements. A plurality of partial matrix operations may be performed using the plurality of processing elements, and partial matrix data may be transmitted between the plurality of processing elements while performing the plurality of partial matrix operations. A result of the matrix operation may be determined based on the plurality of partial matrix operations.
Opening claim text (preview).
What is claimed is: 1. An apparatus, comprising: interface circuitry; a matrix processing cluster (MPC) circuitry, communicatively coupled to the interface circuitry, the MPC circuitry including: memory resource block circuitry to store a plurality of input matrices; a plurality of matrix processing units (MPUs), wherein each MPU includes processing circuitry to perform matrix arithmetic; master control central processing unit (MCC) circuitry to distribute a matrix instruction, received from a controller via the interface circuitry, across the plurality of matrix processing units (MPUs), wherein the matrix instruction is to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations; slicing engine circuitry to partition the plurality of input matrices into a plurality of input partitions based on a number of available MPUs; the MCC circuitry to distribute the plurality of input partitions among the plurality of MPUs, wherein each input partition is distributed to a particular MPU of the plurality of MPUs, wherein the MCC circuitry to shift each input partition to a different MPU of the plurality of MPUs between each of a plurality of stages of the matrix operation; and at least two or more of the plurality of MPUs to perform a plurality of partial matrix operations in the plurality of stages including at least a first partial matrix operation in a first stage by a first MPU using a first input partition and a second partial matrix operation in the first stage by a second MPU using a second input partition, and including at least a third partial matrix operation in a stage subsequent to the first stage by the first MPU using the second input partition and a fourth partial matrix operation in a stage subsequent to the first stage by the second MPU using the first input partition, wherein the first and second input partitions are shifted between at least the first and second MPUs during one or more weight update operations; and the controller to determine a result of the neural network operation based on the plurality of partial matrix operations. 2. The apparatus of claim 1 , wherein the plurality of input matrices includes matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels. 3. The apparatus of claim 2 , wherein the slicing engine circuitry to partition the plurality of input matrices into the plurality of input partitions based on the number of available MPUs is further to partition the plurality of input matrices based on one or more of: a number of channels associated with the one or more images; a number of filters; and a number of images. 4. The apparatus of claim 1 , wherein the MCC circuitry is further to distribute the plurality of partial matrix operations among the plurality of MPUs based on a height and a width of the result of the neural network operation. 5. The apparatus of claim 1 , wherein: the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs; the MCC circuitry to transmit, via the interface circuitry, the partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations is further to transmit a portion of the partial matrix data from each MPU to one or more of the neighbor MPUs while performing a particular stage of the partial matrix operations. 6. The apparatus of claim 5 , wherein the neural network operation is associated with the one or more weight update operations in a neural network. 7. The apparatus of claim 5 , wherein the partial matrix data includes a partial result matrix determined by a first MPU in a particular stage of the partial matrix operations, and wherein the partial result matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations. 8. The apparatus of claim 7 , wherein the neural network operation is associated with a forward propagation operation in a neural network. 9. The apparatus of claim 7 , wherein the neural network operation is associated with a backward propagation operation in a neural network. 10. A method of performing a neural network operation on a matrix processor, comprising: distribute a matrix instruction to perform the neural network operation on a plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations; partitioning the plurality of input matrices into a plurality of input partitions based on a number of available matrix processing units (MPUs) in the matrix processor; distributing the plurality of input partitions among a plurality of MPUs in the matrix processor, wherein each input partition is distributed to a particular MPU of the plurality of MPUs; shifting each input partition to a different MPU of the plurality of MPUs between each of a plurality of stages of the matrix operation; and performing a plurality of partial matrix operations in a plurality of stages, including at least a first partial matrix operation in a first stage by a first MPU using a first input partition and a second partial matrix operation in the first stage by a second MPU using a second input partition, and including at least a third partial matrix operation in a stage subsequent to the first stage by the first MPU using the second input partition and a fourth partial matrix operation in a stage subsequent to the first stage by the second MPU using the first input partition, wherein the first and second input partitions are shifted between at least the first and second MPUs during one or more weight update operations; and determining a result of the neural network operation based on the plurality of partial matrix operations. 11. The method of claim 10 , wherein: the plurality of input matrices includes matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels; and the plurality of input matrices is further partitioned based on one or more of: a number of channels associated with the one or more images; a number of filters; and a number of images. 12. The method of claim 10 , further including distributing the plurality of partial matrix operations to the plurality of MPUs based on a height and a width of the result of the neural network operation. 13. The method of claim 10 , wherein the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs. 14. The method of claim 13 , wherein each MPU transmits a portion of the partial matrix data to one or more of the neighbor MPUs while performing a particular stage of the partial matrix operations. 15. A system, comprising: memory circuitry to store a plurality of input matrices; a plurality of matrix processing chips, wherein each matrix processing chip includes a plurality of matrix processing cluster (MPC) circuitries, the plurality of MPC circuitries to each include a plurality of matrix processing units (MPUs) to perform matrix arithmetic; interface circuitry to communicatively couple the plurality of matrix processing chips; and host processor circuitry to instruct at least one of the plurality of matrix processing chips to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations; the at least one of the plurality of matrix processing chips t
Convolutional networks [CNN, ConvNet] · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Multidimensional correlation or convolution · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.