Architecture of crossbar of inference engine
US-2019244118-A1 · Aug 8, 2019 · US
US11462003B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11462003-B2 |
| Application number | US-202016830167-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 25, 2020 |
| Priority date | Mar 25, 2020 |
| Publication date | Oct 4, 2022 |
| Grant date | Oct 4, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system with a multiplication circuit having a plurality of multipliers is disclosed. Each of the plurality of multipliers is configured to receive a data value and a weight value to generate a product value in a convolution operation of a machine learning application. The system also includes an accumulator configured to receive the product value from each of the plurality of multipliers and a register bank configured to store an output of the convolution operation. The accumulator is further configured to receive a portion of values stored in the register bank and combine the received portion of values with the product values to generate combined values. The register bank is further configured to replace the portion of values with the combined values.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a multiplication circuit comprising a plurality of multipliers, each of the plurality of multipliers configured to receive a data value and a weight value to generate a product value in a convolution operation of a machine learning application, the data value is part of one of a plurality of sub-feature maps that are generated from an input feature map; an accumulator configured to receive the product value from each of the plurality of multipliers; and a register bank configured to store an output of the convolution operation, wherein the accumulator is further configured to receive a portion of values stored in the register bank and combine the received portion of values with the product values to generate combined values; and wherein the register bank is further configured to replace the portion of values with the combined values. 2. The system of claim 1 , wherein the register bank comprises a plurality of row registers configured to shift in a row direction and a plurality of column registers configured to shift in a column direction. 3. The system of claim 1 , wherein combining the received portion of values with the product values comprises adding each of the received portion of values with a corresponding one of the product values. 4. The system of claim 1 , further comprising a reconfigurable tree adder configured to receive the product values and combine groups of the product values. 5. The system of claim 1 , wherein the register bank is configured to shift a subset of values from a previous iteration by one position and send the shifted subset of values to the accumulator. 6. A system comprising: a multiplication circuit comprising a plurality of multipliers, each of the plurality of multipliers configured to receive a data value and a weight value to generate a product value in a convolution operation of a machine learning application; an accumulator configured to receive the product value from each of the plurality of multipliers; a register bank configured to store an output of the convolution operation, wherein the accumulator is further configured to receive a portion of values stored in the register bank and combine the received portion of values with the product values to generate combined values; and wherein the register bank is further configured to replace the portion of values with the combined values; and a first multi-stage interconnection network configured to receive the combined values from the accumulator. 7. The system of claim 6 , wherein the first multi-stage interconnection network is configured to sort the combined values and write the sorted combined values into a vector accumulator register. 8. The system of claim 7 , further comprising a second multi-stage interconnection network configured to read a subset of values from the vector accumulator register and send the subset of values to the accumulator. 9. The system of claim 7 , wherein the vector accumulator register is further configured to receive the portion of values from the register bank before sending the portion of values to the accumulator. 10. The system of claim 6 , further comprising a third multi-stage interconnection network configured to receive the product values from the plurality of multipliers, and send at least some of the product values to the accumulator based on an index value of each of the product values. 11. A method comprising: inputting, by a processor in a machine learning application, a data value and a weight value into each of a plurality of multipliers to generate a plurality of product values in each iteration of a plurality of iterations of a convolution operation; combining, by the processor in each iteration of the plurality of iterations, each of the plurality of product values with one of a plurality of accumulator values in an accumulator to generate a plurality of combined values, wherein the plurality of accumulator values of a current iteration are received from a register bank and are obtained by shifting a subset of values in the register bank after a previous iteration by one position; and replacing, by the processor in each iteration of the plurality of iterations, the plurality of accumulator values with the plurality of combined values in the register bank. 12. The method of claim 11 , wherein values in the register bank after a last iteration of the plurality of iterations provide an output of the convolution operation on an input sub-feature map generated from an input feature map. 13. The method of claim 11 , wherein each of the plurality of multipliers receive a same weight value. 14. The method of claim 11 , wherein at least one of the plurality of multipliers receive the weight value that is different from the weight value received by a remaining one of the plurality of multipliers. 15. The method of claim 11 , further comprising receiving the combined values from the accumulator in a first multi-stage interconnection network. 16. The method of claim 11 , further comprising shifting, by the processor, values in the register bank after a last iteration of the plurality of iterations to obtain an output sub-feature map. 17. A non-transitory computer-readable media comprising computer-readable instructions stored thereon that when executed by a processor associated with a machine learning application cause the processor to: partition an input feature map into a plurality of sub-feature maps; input each of the plurality of sub-feature maps into a tensor compute unit of a plurality of tensor compute units to generate an output sub-feature map, wherein generating the output sub-feature map for a first sub-feature map of the plurality of sub-feature maps comprises: inputting a plurality of data values of the first sub-feature map into a plurality of multipliers of a first tensor compute unit of the plurality of tensor compute units; inputting a weight value into the plurality of multipliers for generating a plurality of product values; combining each of the plurality of product values with one of a previously computed product value to obtain a plurality of combined values; and shifting the plurality of combined values to obtain the output sub-feature map for the first sub-feature map; and combine the output sub-feature map from each of the plurality of tensor compute units to obtain an output feature map. 18. The non-transitory computer-readable media of claim 17 , further comprising performing a non-linear Rectified Linear Unit operation and a pooling operation on the shifted plurality of combined values to obtain the output sub-feature map. 19. The non-transitory computer-readable media of claim 17 , further comprising compressing the output sub-feature map before combining to obtain the output feature map. 20. The non-transitory computer-readable media of claim 18 , wherein each of the plurality of data values that are input into the plurality of multipliers is a non-zero value, and wherein the weight value is a non-zero value.
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
using specific electronic processors · CPC title
using neural networks · CPC title
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
using classification, e.g. of video objects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.