Performing matrix multiplication in hardware
US-2018336165-A1 · Nov 22, 2018 · US
US11762803B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11762803-B2 |
| Application number | US-202217659642-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 18, 2022 |
| Priority date | Jun 29, 2020 |
| Publication date | Sep 19, 2023 |
| Grant date | Sep 19, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are provided to enable parallelized multiply-accumulate operations in a systolic array. Each column of the systolic array can include multiple busses enabling independent transmission of input partial sums along the respective bus. Each processing element of a given columnar bus can receive an input partial sum from a prior element of the given columnar bus, and perform arithmetic operations on the input partial sum. Each processing element can generate an output partial sum based on the arithmetic operations, provide the output partial sum to a next processing element of the given columnar bus, without the output partial sum being processed by a processing element of the column located between the two processing elements that uses a different columnar bus. Use of columnar busses can enable parallelization to increase speed or enable increased latency at individual processing elements.
Opening claim text (preview).
What is claimed is: 1. A systolic processor comprising: a systolic array of processing elements arranged in rows and columns, each processing element of the processing elements comprising: a multiplier configured to multiply an input data element by a weight to generate a multiplier product, and an adder configured to generate an output partial sum by adding the multiplier product and an input partial sum; wherein the columns comprise a first and second column that each comprise a plurality of processing elements, the second column located within the systolic array subsequent to the first column; wherein the first column is coupled to a first row-oriented bus, and the second column is coupled to a second row-oriented bus; wherein the first column is an initial column of the first row-oriented bus and the second column is an initial column of the second row-oriented bus; wherein the systolic array comprises a first set of delay registers associated with the first row-oriented bus and a second set of delay registers associated with the second row-oriented bus, and wherein the first set of delay registers and the second set of delay registers include the same number of delay registers, wherein a final delay register, within each of the first and second sets of delay registers, is located within the systolic array prior to a final column of the row-oriented bus to which the respective set of delay registers is associated; and wherein a first processing element of the first column is configured to receive a first input data element and a first weight from the first set of delay registers and a second processing element of the second column is configured to receive a second input data element and a second weight from the second set of delay registers. 2. The systolic processor of claim 1 , wherein the systolic array of processing elements is divided into a first sub-array of processing elements and a second sub-array of processing elements, each sub-array of processing elements including one or more consecutive columns of the systolic array, wherein the first sub-array of processing elements includes the first column and the second sub-array of processing elements include the second column. 3. The systolic processor of claim 1 , wherein the first set of delay registers comprises a first set of pipelining registers and the second set of delay registers comprises a second set of pipelining registers. 4. The systolic processor of claim 1 , wherein each processing element of the processing elements has a latency of a given number, n, clock cycles, wherein each of the first set of delay registers and the second set of delay registers stores respective data for 2n clock cycles. 5. The systolic processor of claim 1 , wherein each processing element of the processing elements has a latency of a given number, n, clock cycles, wherein each of the first set of delay registers and the second set of delay registers comprises 2n delay registers. 6. A systolic circuit comprising: a systolic array comprising a plurality of processing elements arranged into rows and columns, each row of the systolic array including a plurality of row-oriented buses, each row-oriented bus configured to pass data through one or more delay registers to a set of processing elements for use in performance of mathematical operations by the set of processing elements corresponding to the row-oriented bus, wherein a final delay register, within the one or more delay registers, is located within the systolic array prior to a final column of the row-oriented bus to which the one or more delay registers is associated, wherein a final column of a second row-oriented bus, within the plurality of row-oriented buses is located in the systolic array subsequent to a final column of a first row-oriented bus within the plurality of row-oriented buses, and wherein an initial column of the second row-oriented bus is located, in the systolic array, subsequent to an initial column of the first row-oriented bus, each processing element within an individual column of the systolic array configured to: receive a respective weight and a respective input data element, perform one or more operations on the weight and the input data element, and provide the weight and the input data element to a processing element of a subsequent column corresponding to the row-oriented bus of the individual column. 7. The systolic circuit of claim 6 , wherein each column of the columns includes a plurality of columnar buses. 8. The systolic circuit of claim 7 , wherein each of the plurality of columnar buses is implemented as an additional set of processing elements in non-consecutive rows. 9. The systolic circuit of claim 6 , wherein: each processing element of the plurality of processing elements comprises: a multiplier configured to calculate a product, and an adder configured to add the product and an input partial sum and generate an output partial sum, wherein the one or more operations comprises at least a multiplication operation by the multiplier and an addition operation by the adder. 10. The systolic circuit of claim 6 , wherein each row of the rows is configured to receive a plurality of weights or a plurality of input data elements. 11. The systolic circuit of claim 6 , wherein a first processing element of each row-oriented bus is configured to receive respective data during a first time period based on the one or more delay registers. 12. The systolic circuit of claim 6 , wherein a first processing element of each row-oriented bus is configured to receive a respective input data element and a respective weight during a first time period based on the one or more delay registers. 13. The systolic circuit of claim 6 , wherein each of the one or more delay registers is configured to store respective data for a systolic interval. 14. The systolic circuit of claim 13 , wherein the systolic interval corresponds to a latency of one or more processing elements of the plurality of processing elements. 15. The systolic circuit of claim 14 , wherein the systolic interval is based on a number of processing elements of the set of processing elements. 16. The systolic circuit of claim 6 , wherein each processing element of the plurality of processing elements has a latency of a first number, n, clock cycles and the set of processing elements comprises a second number, m, processing elements, each of the one or more delay registers may store respective data for n multiplied by m clock cycles. 17. The systolic circuit of claim 6 , wherein each processing element of the set of processing elements are: adjacent to at least one processing element of the set of processing elements, or separated from each other processing element of the set of processing elements by at least one processing element. 18. The systolic circuit of claim 6 , wherein the one or more delay registers are distributed within the systolic array based on a distance that respective data can travel during a clock cycle. 19. A method comprising: receiving first data corresponding to a first row-oriented bus of a systolic array and second data corresponding to a second row-oriented bus of a systolic array; passing the first data through one or more first delay registers, wherein a final delay register, within the one or more first delay registers, is located within the systolic array prior to a final column of the first row-oriented bus; passing the second data through one or more second delay registers, wherein a final delay register, within t
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Systolic arrays · CPC title
Neural networks · CPC title
in parallel-parallel fashion, i.e. both operands being entered in parallel (G06F7/533 takes precedence) · CPC title
Arithmetic instructions · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.