Generalized acceleration of matrix multiply accumulate operations
US-2018321938-A1 · Nov 8, 2018 · US
US2022405560A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022405560-A1 |
| Application number | US-202217807082-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 15, 2022 |
| Priority date | Jun 17, 2021 |
| Publication date | Dec 22, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure discloses a processing element and a neural processing device including the processing element. The processing element includes a weight register configured to store a weight, an input activation register configured to store input activation, a flexible multiplier configured to generate result data by performing a multiplication operation of the weight and the input activation by using a first multiplier of a first precision or using both the first multiplier and a second multiplier of the first precision in response to a calculation mode signal and a saturating adder configured to generate a partial sum by using the result data.
Opening claim text (preview).
What is claimed is: 1 . A processing element comprising: a weight register configured to store a weight; an input activation register configured to store input activation; a flexible multiplier configured to generate result data by performing a multiplication operation of the weight and the input activation by using a first multiplier of a first precision or using both the first multiplier and a second multiplier of the first precision in response to a calculation mode signal; and a saturating adder configured to generate a partial sum by using the result data. 2 . The processing element of claim 1 , wherein the flexible multiplier performs a multiplication operation of the weight and the input activation by using the first multiplier when the calculation mode signal is a first mode signal associated with the first precision, and performs the multiplication operation of the weight and the input activation by using both the first multiplier and the second multiplier when the calculation mode signal is a second mode signal associated with a second precision greater than the first precision. 3 . The processing element of claim 2 , wherein the flexible multiplier comprises an aligner that a first aligned partial multiplication group and a second aligned partial multiplication group by aligning digits of the first partial multiplication group generated by the first multiplier and a second partial multiplication group generated by the second multiplier when the calculation mode signal is the second mode signal. 4 . The processing element of claim 3 , wherein the flexible multiplier comprises a first booth reduction tree configured to calculate the first aligned partial multiplication group, and a second booth reduction tree configured to calculate the second aligned partial multiplication group, and a depth of the first aligned partial multiplication group is greater than a depth of the second aligned partial multiplication group. 5 . The processing element of claim 3 , wherein the flexible multiplier comprises a first booth reduction tree configured to calculate the first aligned partial multiplication group, and a second booth reduction tree configured to calculate the second aligned partial multiplication group, and a calculable depth of the first booth reduction tree is greater than a calculable depth of the second booth reduction tree. 6 . The processing element of claim 3 , wherein the flexible multiplier comprises one first booth reduction tree configured to calculate the first aligned partial multiplication group, and a plurality of second booth reduction trees configured to calculate the second aligned partial multiplication group. 7 . The processing element of claim 3 , wherein, when the weight and the input activation are each 32-bit data, the first precision is INT4, and the second precision is INT8, the flexible multiplier comprises one first booth reduction tree that calculates the first aligned partial multiplication group, and four second booth reduction trees that calculate the second aligned partial multiplication group. 8 . The processing element of claim 1 , wherein the flexible multiplier comprises a booth reduction tree that generates the result data by using partial multiplication groups generated by the first multiplier and the second multiplier. 9 . The processing element of claim 8 , wherein the booth reduction tree comprises a depth reducer that reduces depths of the partial multiplication groups, and an adder that performs an addition operation of the partial multiplication groups of which depths are reduced by the depth reducer. 10 . The processing element of claim 1 , wherein each of the first multiplier and the second multiplier is composed of k multipliers. 11 . The processing element of claim 10 , wherein k is 8 if the weight and the input activation are each 32-bit data, the first precision is INT4, and the second precision is INT8. 12 . The processing element of claim 1 , wherein the flexible multiplier comprises an aligner that generates a first aligned partial multiplication group and a second aligned partial multiplication group by using partial multiplication groups generated by the first multiplier and the second multiplier, a first booth reduction tree that calculates the first aligned partial multiplication group, a second booth reduction tree that calculates the second aligned partial multiplication group, and a pre-adder that performs an addition operation on an operation result of the second booth reduction tree, and a calculation result of the first booth reduction tree and a calculation result of the pre-adder are provided to the saturating adder. 13 . The processing element of claim 1 , wherein the flexible multiplier comprises a bit division logic that generates a first divided weight of the first precision by using the weight and generates a first divided input activation of the first precision by using the input activation. 14 . The processing element of claim 13 , wherein, when the calculation mode signal is a first mode signal associated with the first precision, the first multiplier generates the result data by using the first divided weight and the first divided input activation. 15 . The processing element of claim 13 , wherein, when the calculation mode signal is a second mode signal associated with a second precision greater than the first precision, the bit division logic generates a first high-order divided weight and a first low-order divided weight by using the first divided weight, and generates a first high-order divided input activation and a first low-order divided input activation by using the first divided input activation. 16 . The processing element of claim 15 , wherein the first low-order divided weight and the first low-order divided input activation each comprise an extra bit for having a positive value. 17 . A neural processing device comprising: at least one neural core, wherein the neural core comprises a processing unit that performs calculation, and a L0 memory for storing input/output data of the processing unit, the processing unit comprises a PE array including at least one processing element, and the PE array comprises a flexible multiplier that receives a weight and an input activation and generates a plurality of partial multiplication groups by using a first multiplier of a first precision or both the first multiplier and a second multiplier of the first precision in response to a calculation mode signal and generates result data by using the plurality of partial multiplication groups, and a saturating adder that receives the result data and generates a partial sum. 18 . The neural processing device of claim 17 , wherein the flexible multiplier generates the result data by performing an addition operation of the plurality of partial multiplication groups by using a Booth algorithm. 19 . The neural processing device of claim 17 , wherein the flexible multiplier groups the plurality of partial multiplication groups into a plurality of aligned partial multiplication groups based on digits thereof, and generates the result data by performing an addition operation on the plurality of aligned partial multiplication groups. 20 . A multiplication operation method comprising: receiving a weight, an input activation, and a calculation mode signal; generating a plurality of divided weights by using the weight; generating a plurality of divided input activations by using the input activation; determi
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even · CPC title
Dividing only · CPC title
Adding; Subtracting (G06F7/483 - G06F7/491, G06F7/544 - G06F7/556 take precedence) · CPC title
using electronic means · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.