Methods and apparatus to tile walk a tensor for convolution operations
US-2019370631-A1 · Dec 5, 2019 · US
US11544191B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11544191-B2 |
| Application number | US-202016830457-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 26, 2020 |
| Priority date | Mar 26, 2020 |
| Publication date | Jan 3, 2023 |
| Grant date | Jan 3, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Hardware accelerators for accelerated grouped convolution operations. A first buffer of a hardware accelerator may receive a first row of an input feature map (IFM) from a memory. A first group comprising a plurality of tiles may receive a first row of the IFM. A plurality of processing elements of the first group may compute a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel. A second buffer of the accelerator may receive a third row of the IFM from the memory. A second group comprising a plurality of tiles may receive the third row of the IFM. A plurality of processing elements of the second group may compute a portion of a third row of the OFM based on the third row of the IFM and the kernel as part of a grouped convolution operation.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving, by a first buffer of a hardware accelerator, a first row of an input feature map (IFM) from a memory; receiving, by a first group comprising a first plurality of tiles, a first row of the IFM from the first buffer, wherein each tile of the first group comprises a plurality of processing elements; computing, by the plurality of processing elements of the first plurality of tiles, at least a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel; receiving, by a second buffer, a third row of the IFM from the memory; receiving, by a second group comprising a plurality of tiles, the third row of the IFM from the second buffer, wherein each tile of the second group comprises a plurality of processing elements; and computing, by the plurality of processing elements of the second plurality of tiles, at least a portion of a third row of the OFM based on the third row of the IFM and the kernel, wherein the portions of the first and third rows of the OFM are computed by the hardware accelerator in parallel as part of a grouped convolution operation. 2. The method of claim 1 , wherein computing the portion of the first row of the OFM comprises: performing a multiply-accumulate (MAC) operation on the first row of the IFM and the kernel in a first cycle; shifting, in the first buffer, the first row of the IFM to produce a first shifted first row of the IFM; receiving, by the first group, the first shifted first row of the IFM; performing the MAC operation on the first shifted first row of the IFM and the kernel in a second cycle; shifting, in the first buffer, the first shifted first row of the IFM to produce a second shifted first row of the IFM; and performing the MAC operation on the second shifted first row of the IFM and the kernel in a third cycle. 3. The method of claim 2 , wherein computing the portion of the third row of the OFM comprises: performing the MAC operation on the third row of the IFM and the kernel in the second cycle; shifting, in the second buffer, the first row of the IFM to produce a first shifted third row of the IFM; receiving, by the second group, the first shifted third row of the IFM; performing the MAC operation on the first shifted third row of the IFM and the kernel in the third cycle; shifting, in the second buffer, the first shifted third row of the IFM to produce a second shifted third row of the IFM; and performing the MAC operation on the second shifted third row of the IFM and the kernel in a fourth cycle. 4. The method of claim 3 , further comprising: receiving, by the first buffer, a second row of the IFM from the memory; receiving, by the second buffer, a fourth row of the IFM from the memory; computing at least a second portion of the first row of the OFM based at least in part on the second row of the IFM; and computing at least a portion of a fourth row of the OFM based at least in part on the fourth row of the IFM, wherein the second portion of the first OFM and the portion of the fourth row of the OFM are computed by the hardware accelerator in parallel. 5. The method of claim 1 , further comprising: determining a size of the kernel or a stride size of the kernel specified in one or more configuration registers; generating the first group and the second group based at least in part on the size of the kernel or the stride size of the kernel; and organizing the tiles of the first group and the second group into a compute grid based at least in part on the size of the kernel or the stride size of the kernel. 6. The method of claim 1 , wherein the hardware accelerator comprises logic for a convolutional neural network, wherein the hardware accelerator is configured to perform grouped convolution operations using the convolutional neural network. 7. The method of claim 1 , further comprising: computing each of a plurality of rows of the OFM, wherein the first and third rows of the OFM are of the plurality of rows of the OFM; and storing the plurality of rows of the OFM in the memory. 8. An apparatus, comprising: memory; and a hardware accelerator coupled to the memory, the hardware accelerator comprising logic configured to: receive, by a first buffer of the hardware accelerator, a first row of an input feature map (IFM) from the memory; receive, by a first group comprising a first plurality of tiles, a first row of the IFM from the first buffer, wherein each tile of the first group comprises a plurality of processing elements; compute, by the plurality of processing elements of the first plurality of tiles, at least a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel; receive, by a second buffer, a third row of the IFM from the memory; receive, by a second group comprising a plurality of tiles, the third row of the IFM from the second buffer, wherein each tile of the second group comprises a plurality of processing elements; and compute, by the plurality of processing elements of the second plurality of tiles, at least a portion of a third row of the OFM based on the third row of the IFM and the kernel, wherein the portions of the first and third rows of the OFM are computed by the hardware accelerator in parallel as part of a grouped convolution operation. 9. The apparatus of claim 8 , the logic to compute the portion of the first row of the OFM to comprise logic to: perform a multiply-accumulate (MAC) operation on the first row of the IFM and the kernel in a first cycle; shift, in the first buffer, the first row of the IFM to produce a first shifted first row of the IFM; receive, by the first group, the first shifted first row of the IFM; perform the MAC operation on the first shifted first row of the IFM and the kernel in a second cycle; shift, in the first buffer, the first shifted first row of the IFM to produce a second shifted first row of the IFM; and perform the MAC operation on the second shifted first row of the IFM and the kernel in a third cycle. 10. The apparatus of claim 9 , the logic to compute the portion of the third row of the OFM to comprise logic to: perform the MAC operation on the third row of the IFM and the kernel in the second cycle; shift, in the second buffer, the first row of the IFM to produce a first shifted third row of the IFM; receive, by the second group, the first shifted third row of the IFM; perform the MAC operation on the first shifted third row of the IFM and the kernel in the third cycle; shift, in the second buffer, the first shifted third row of the IFM to produce a second shifted third row of the IFM; and perform the MAC operation on the second shifted third row of the IFM and the kernel in a fourth cycle. 11. The apparatus of claim 10 , the hardware accelerator comprising logic configured to: receive, by the first buffer, a second row of the IFM from the memory; receive, by the second buffer, a fourth row of the IFM from the memory; compute at least a second portion of the first row of the OFM based at least in part on the second row of the IFM; and compute at least a portion of a fourth row of the OFM based at least in part on the fourth row of the IFM, wherein the second portion of the first OFM and the portion of the fourth row of the OFM are computed by the hardware accelerator in parallel. 12. The apparatus of claim 8 , the hardware accelerator comprising logic configured to: determine a size of the kernel or a stride size of the kernel specified in one or more configuration registers; generate the first group and the second group based at least in part on the size
Combinations of networks · CPC title
Caching of specific data in cache memory · CPC title
with multilevel cache hierarchies · CPC title
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Performance improvement · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.