Neural network processing system having multiple processors and a neural network accelerator
US-2019114534-A1 · Apr 18, 2019 · US
US11354133B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11354133-B2 |
| Application number | US-201916663210-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 24, 2019 |
| Priority date | Aug 31, 2017 |
| Publication date | Jun 7, 2022 |
| Grant date | Jun 7, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A matrix-multiplying-vector operation method and a processing device for performing the same are provided. The matrix-multiplying-vector method includes distributing, by a main processing circuit, basic data blocks of the matrix and broadcasting the vector to a plurality of the basic processing circuits. That way, the basic processing circuits can perform inner-product operations between the basic data blocks and the broadcasted vector in parallel. The results are then provided back to main processing circuit for combining. The technical solutions proposed by the present disclosure provide short operation time and low energy consumption.
Opening claim text (preview).
What is claimed is: 1. A matrix-multiplying-vector operation method, performed by a processing device comprising a main processing circuit and a plurality of basic processing circuits, the matrix-multiplying-vector operation method comprising: receiving, by the main processing circuit, a matrix, a vector, and a matrix-multiplying-vector operation instruction; dividing, by the main processing circuit, the matrix into a plurality of basic data blocks; distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits, wherein: each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks; and at least two of the plurality of basic data blocks are distributed to a same basic processing circuit at one time when a number of the plurality of basic data blocks is larger than a number of the plurality of basic processing circuits; broadcasting, by the main processing circuit, at least a subset of the vector to the plurality of basic processing circuits simultaneously, wherein each of the plurality of basic processing circuits receives a same copy of the subset of the vector; performing, by each of the plurality of basic processing circuits, one or more inner-product operations on one or more basic data blocks distributed to that basic processing circuit and the same copy of the subset of the vector broadcast to that basic processing circuit to obtain a processing result, wherein the plurality of basic processing circuits perform the respective inner-product operations in parallel; providing, by the plurality of basic processing circuits, the respective processing results to the main processing circuit; and combining, by the main processing circuit, the processing results provided by the plurality of basic processing circuits to obtain a computation result of the matrix-multiplying-vector operation instruction. 2. The matrix-multiplying-vector operation method of claim 1 , wherein distributing the plurality of basic data blocks to the plurality of basic processing circuits includes: distributing the plurality of basic data blocks to the plurality of basic processing circuits non-repetitively and in an arbitrary order. 3. The matrix-multiplying-vector operation method of claim 1 , wherein distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits includes: when the number of the plurality of basic data blocks is smaller than or equal to the number of the plurality of basic processing circuits, distributing, by the main processing circuit, each of the plurality of basic data blocks to a separate basic processing circuit. 4. The matrix-multiplying-vector operation method of claim 1 , wherein the processing device further includes multiple branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, and the matrix-multiplying-vector operation method further includes: transmitting, by the multiple branch processing circuits, data among the main processing circuit and the plurality of basic processing circuits. 5. The matrix-multiplying-vector operation method of claim 1 , wherein the main processing circuit includes at least one of a vector arithmetic unit circuit, an arithmetic logic unit (ALU) circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access (DMA) circuit, or a data rearrangement circuit. 6. The matrix-multiplying-vector operation method of claim 1 , wherein each of the plurality of basic processing circuits includes at least one of an inner-product arithmetic unit circuit or an accumulator circuit. 7. The matrix-multiplying-vector operation method of claim 1 , wherein the matrix is a weight matrix of a fully connected layer of a neural network, and the vector is input data of a single sample to the fully connected layer. 8. The matrix-multiplying-vector operation method of claim 1 , wherein at least one dimension of the matrix has a same size as that of the vector. 9. A processing device comprising a main processing circuit and a plurality of basic processing circuits, wherein: the main processing circuit is configured to: receive a matrix, a vector, and a matrix-multiplying-vector operation instruction; divide the matrix into a plurality of basic data blocks; distribute the plurality of basic data blocks to the plurality of basic processing circuits, wherein: each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks; and at least two of the plurality of basic data blocks are distributed to a same basic processing circuit at one time when a number of the plurality of basic data blocks is larger than a number of the plurality of basic processing circuits; and broadcast at least a subset of the vector to the plurality of basic processing circuits simultaneously, wherein each of the plurality of basic processing circuits receives a same copy of the subset of the vector; each of the plurality of basic processing circuits is configured to: perform one or more inner-product operations on one or more basic data blocks distributed to that basic processing circuit and the same copy of the subset of the vector broadcast to that basic processing circuit to obtain a processing result; and provide the processing result to the main processing circuit; wherein the plurality of basic processing circuits are configured to perform the respective inner-product operations in parallel; and the main processing circuit is further configured to combine the processing results provided by the plurality of basic processing circuits to obtain a computation result of the matrix-multiplying-vector operation instruction. 10. The processing device of claim 9 , wherein the main processing circuit is configured to distribute the plurality of basic data blocks to the plurality of basic processing circuits non-repetitively and in an arbitrary order. 11. The processing device of claim 9 , wherein the main processing circuit is configured to: distribute each of the plurality of basic data blocks to a separate basic processing circuit when the number of the plurality of basic data blocks is smaller than or equal to the number of the plurality of basic processing circuits. 12. The processing device of claim 9 , further comprising multiple branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, wherein the multiple branch processing circuits are configured to: transmit data among the main processing circuit and the plurality of basic processing circuits. 13. The processing device of claim 12 , wherein each of the multiple branch processing circuit is connected between the main processing circuit and at least one of the plurality of basic processing circuits. 14. The processing device of claim 9 , wherein the main processing circuit includes at least one of a vector arithmetic unit circuit, an arithmetic logic unit (ALU) circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access (DMA) circuit, or a data rearrangement circuit. 15. The processing device of claim 9 , wherein each of the plurality of basic processing circuits includes at least one of an inner-product arithmetic unit circuit or an accumulator circuit. 16. The processing de
Preprocessing · CPC title
Activation functions · CPC title
Combinations of networks · CPC title
using electronic means · CPC title
Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.