Data format suitable for fast massively parallel general matrix multiplication in a programmable IC
US-10515135-B1 · Dec 24, 2019 · US
US11003619B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11003619-B2 |
| Application number | US-201916283795-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 24, 2019 |
| Priority date | Feb 24, 2019 |
| Publication date | May 11, 2021 |
| Grant date | May 11, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure is directed to systems and methods for decomposing systolic array circuitry to provide a plurality of N×N systolic sub-array circuits, apportioning a first tensor or array into a plurality of N×M first input arrays, and apportioning a second tensor or array into a plurality of M×N second input arrays. Systolic array control circuitry transfers corresponding ones of the first input arrays and second input arrays to a respective one of the plurality of N×N systolic sub-array circuits. As the elements included in the first input array and the elements included in the second input array are transferred to the systolic sub-array, the systolic sub-array performs one or more mathematical operations using the first and the second input arrays. The systems and methods beneficially improve the usage of the systolic array circuitry thereby advantageously reducing the number of clock cycles needed to perform a given number of calculations.
Opening claim text (preview).
What is claimed: 1. A system, comprising: systolic array circuitry; and systolic array control circuitry to: decompose the systolic array circuitry into a plurality of N×N systolic sub-arrays; and apportion a first input tensor into a first plurality of N×M input arrays and a second input tensor into a second plurality of M×N input arrays; and wherein each respective one of at least a portion of the plurality of N×N systolic sub-arrays is to perform at least one mathematical operation to provide a respective one of a plurality of N×N results using corresponding ones of the N×M input arrays included in the first plurality of N×M input arrays and the M×N input arrays included in the second plurality of M×N input arrays. 2. The system of claim 1 , the systolic array control circuitry to further: combine the plurality of N×N results to provide one or more N×N output tensors. 3. The system of claim 2 , the systolic array control circuitry to further: cause a transfer of the one or more N×N output tensors to memory circuitry. 4. The system of claim 1 : wherein the at least one mathematical operation includes a multiplication operation; and the systolic array control circuitry to further: sum corresponding elements in each of the plurality of N×N results to provide one or more N×N output tensors. 5. The system of claim 1 : wherein the plurality of N×N systolic sub-arrays comprise a plurality of 2×2 systolic sub arrays; wherein the first plurality of N×M input arrays includes a plurality of 2×1 input arrays; and wherein the second plurality of M×N input arrays includes a plurality of 1×2 input arrays. 6. The system of claim 1 , the systolic array control circuitry to further: cause a transfer of the first input tensor from memory circuitry; and cause a transfer of the second input tensor from the memory circuitry. 7. A non-transitory storage device that includes instructions that, when executed by systolic array control circuitry, cause the systolic array control circuitry to: decompose a systolic array circuitry into a plurality of N×N systolic sub-arrays; apportion a first input tensor into a first plurality of N×M input arrays and a second input tensor into a second plurality of M×N input arrays; and wherein each respective one of at least a portion of the plurality of N×N systolic sub-arrays is to perform at least one mathematical operation to provide a respective one of a plurality of N×N results using corresponding ones of the N×M input arrays included in the first plurality of N×M input arrays and the M×N input arrays included in the second plurality of M×N input arrays. 8. The non-transitory storage device of claim 7 wherein the instructions further cause the systolic array control circuitry to: combine the plurality of N×N results to provide one or more N×N output tensors. 9. The non-transitory storage device of claim 8 wherein the instructions further cause the systolic array control circuitry to: cause a transfer of the one or more N×N output tensors to memory circuitry. 10. The non-transitory storage device of claim 7 wherein the instructions further cause the systolic array control circuitry to: transfer corresponding elements of the first plurality of N×M input arrays and the second plurality of M×N input arrays to the plurality of N×N systolic sub-arrays to cause performance of a multiplication operation to provide a respective one of the plurality of N×N results. 11. The non-transitory storage device of claim 10 wherein the instructions further cause the systolic array control circuitry to: sum corresponding elements in each of the plurality of N×N results to provide the one or more N×N output tensors. 12. The non-transitory storage device of claim 7 : wherein the instructions that cause the systolic array control circuitry to decompose the systolic array circuitry into a plurality of N×N systolic sub-arrays further cause the systolic array control circuitry to: decompose the systolic array circuitry into a plurality of 2×2 systolic sub-arrays; wherein the instructions that cause the systolic array control circuitry to apportion the first input tensor into the first plurality of N×M input arrays further cause the systolic array control circuitry to: apportion the first input tensor into a first plurality of 2×1 input arrays; and wherein the instructions that cause the systolic array control circuitry to apportion the second input tensor into the second plurality of M×N input arrays further cause the systolic array control circuitry to: apportion the second input tensor into a second plurality of 1×2 input arrays. 13. The non-transitory storage device of claim 10 wherein the instructions further cause the systolic array control circuitry to: cause a transfer of the first input tensor from memory circuitry; and cause a transfer of the second input tensor from the memory circuitry. 14. A method, comprising: decomposing, by systolic array control circuitry, a systolic array circuitry into a plurality of N×N systolic sub-arrays; apportioning, by the systolic array control circuitry, a first input tensor into a first plurality of N×M input arrays and a second input tensor into a second plurality of M×N input arrays; and for each respective one of at least a portion of the plurality of N×N systolic sub-arrays, performing at least one mathematical operation to provide a respective one of a plurality of N×N results using corresponding ones of the N×M input arrays included in the first plurality of N×M input arrays and the M×N input arrays included in the second plurality of M×N input arrays. 15. The method of claim 14 , further comprising: combining, by the systolic array control circuitry, the plurality of N×N results to provide one or more N×N output tensors. 16. The method of claim 15 , further comprising: transferring, by the systolic array control circuitry, the one or more N×N output tensors to memory circuitry. 17. The method of claim 14 wherein performing the at least one mathematical operation to provide the plurality of N×N results comprises: causing the systolic array circuitry to perform a multiplication operation to provide the plurality of N×N results. 18. The method of claim 17 , further comprising: summing corresponding elements in each of the plurality of N×N results to provide one or more N×N output tensors. 19. The method of claim 14 : wherein decomposing the systolic array circuitry into the plurality of N×N systolic sub-arrays comprises: decomposing, by the systolic array control circuitry, the systolic array circuitry into a plurality of 2×2 systolic sub-arrays; wherein apportioning the first input tensor into the first plurality of N×M input arrays comprises: apportioning, by the systolic array control circuitry, the first input tensor into a first plurality of 2×1 input arrays; and wherein apportioning the second input tensor into the second plurality of M×N input arrays comprises apportioning, by the systolic array control circuitry, the second input tensor into a second plurality of 1×2 input arrays. 20. The method of claim 14 , further comprising: transferring the first input tensor from memory circuitry coupled to the systolic array circuitry; and transferring the second input tensor from the memory circuitry. 21. A system, comprising: means for decomposing a systolic array circuitry into a plurality of N×N systolic sub-arrays; means for apportioning a first input tensor int
single instruction multiple data [SIMD] multiprocessors · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Two dimensional arrays, e.g. mesh, torus · CPC title
Systolic arrays · CPC title
using electronic means · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.