Data reuse and efficient processing scheme in executing convolutional neural network
US-2021209442-A1 · Jul 8, 2021 · US
US11663446B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11663446-B2 |
| Application number | US-202016734792-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 6, 2020 |
| Priority date | Jan 6, 2020 |
| Publication date | May 30, 2023 |
| Grant date | May 30, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure relates to a device for executing a convolutional neural network operation. The device comprises a first memory, a processing array comprising a plurality of processing strings, and a controller. The controller can be configured to fetch one or more batches of data into the first memory, regroup the one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items, and broadcast the multiple work items to the processing array, wherein the first work item is transferred to two or more processing strings of the processing array.
Opening claim text (preview).
What is claimed is: 1. A device for executing a convolutional neural network operation, comprising: a first memory; a processing array comprising a plurality of processing strings; and a controller configured to: fetch one or more batches of data into the first memory; regroup the fetched one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items; broadcast the multiple work items to the processing array, wherein the first work item is transferred to two or more processing strings of the processing array; and deallocate a portion of the one or more batches of data when the portion of the one or more batches of data is determined not to be used in a predetermined time period. 2. The device of claim 1 , wherein the plurality of processing strings are classified into a plurality of subsets and the first work item is transferred to a first processing string in each of the plurality of subsets. 3. The device of claim 2 , further comprising a second memory storing a plurality of filters of which number corresponds to a number of the subsets. 4. The device of claim 1 , wherein each of the processing strings includes a multiplier and an accumulator. 5. The device of claim 3 , wherein each of the processing strings includes a multiplier and an accumulator, and wherein the processing array includes an element-wise operation processor in each of the plurality of subsets. 6. The device of claim 1 , wherein the controller is further configured to: traverse the one or more batches of data in the first memory to determine a size of the one or more batches of data covers a predetermined data size corresponding to a size of each of the multiple work items. 7. The device of claim 6 , wherein the controller is further configured to: fetch an additional batch of data into the first memory when the size of the one or more batches of data is determined not to cover a predetermined data size corresponding to the size of each of the multiple work items. 8. The device of claim 1 , wherein each of the multiple work items has a first data size, the one or more batches of data has a plurality of channels, and each channel has a second data size covering the first data size. 9. A method for executing a convolutional neural network operation, comprising: fetching one or more batches of data in a first memory; regrouping the fetched one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items; broadcasting the multiple work items to a processing array comprising a plurality of processing strings, wherein the first work item is transferred to two or more processing strings of the processing array; and deallocating a portion of the one or more batches of data when the portion of the one or more batches of data is determined not to be used in a predetermined time period. 10. The method of claim 9 , wherein the plurality of processing strings are classified into a plurality of subsets and the first work item is transferred to a first processing string in each of the plurality of subsets. 11. The method of claim 10 , further comprising: transferring a plurality of filters to the processing array, wherein a number of the plurality of filters corresponds to a number of the plurality of subsets and each of the plurality of filter is transferred to a corresponding subset among the plurality of subsets. 12. The method of claim 9 , further comprising: performing a multiplication operation on the first work item in the two or more processing strings in parallel. 13. The method of claim 12 , further comprising: performing an addition operation on multiplication results in the two or more processing strings in parallel. 14. The method of claim 9 , further comprising: traversing the one or more batches of data in the first memory to determine a size of the one or more batches of data covers a predetermined data size corresponding to a size of each of the multiple work items. 15. The method of claim 14 , further comprising: fetching an additional batch of data into the first memory when the size of the one or more batches of data is determined not to cover a predetermined data size corresponding to the size of each of the multiple work items. 16. The method of claim 9 , further comprising: generating a plurality of outputs by the plurality of processing strings in parallel. 17. A non-transitory computer readable storage medium storing a set of instructions that are executable by at least one processor of a computing device to cause the computing device to perform a method for executing a convolutional neural network operation, the method comprising: fetching one or more batches of data in a first memory; regrouping the fetched one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items; broadcasting the multiple work items to a processing array comprising a plurality of processing strings, wherein the first work item is transferred to two or more processing strings of the processing array; and deallocating a portion of the one or more batches of data when the portion of the one or more batches of data is determined not to be used in a predetermined time period. 18. The computer readable storage medium of claim 17 , wherein the plurality of processing strings are classified into a plurality of subsets and the first work item is transferred to a first processing string in each of the plurality of subsets. 19. The computer readable storage medium of claim 18 , wherein the set of instructions that are executable by at least one processor of the computing device to cause the computing device to further perform: transferring a plurality of filters to the processing array, wherein a number of the plurality of filters corresponds to a number of the plurality of subsets and each of the plurality of filter is transferred to a corresponding subset among the plurality of subsets. 20. The computer readable storage medium of claim 17 , wherein the set of instructions that are executable by at least one processor of the computing device to cause the computing device to further perform: performing a multiplication operation on the first work item in the two or more processing strings in parallel. 21. The computer readable storage medium of claim 20 , wherein the set of instructions that are executable by at least one processor of the computing device to cause the computing device to further perform: performing an addition operation on multiplication results in the two or more processing strings in parallel. 22. The computer readable storage medium of claim 17 , wherein the set of instructions that are executable by at least one processor of the computing device to cause the computing device to further perform: traversing the one or more batches of data in the first memory to determine a size of the one or more batches of data covers a predetermined data size corresponding to a size of each of the multiple work items. 23. The computer readable storage medium of claim 22 , wherein the set of instructions that are executable by at least one processor of the computing device to cause the computing device to further perform: fetching an additional batch of data into the first memory when the size of the one
Quantised networks; Sparse networks; Compressed networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Interfaces, programming languages or software development kits, e.g. for simulating neural networks · CPC title
Combinations of networks · CPC title
Architecture, e.g. interconnection topology · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.