Efficient hardware architecture for accelerating grouped convolutions

US11544191B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11544191-B2
Application numberUS-202016830457-A
CountryUS
Kind codeB2
Filing dateMar 26, 2020
Priority dateMar 26, 2020
Publication dateJan 3, 2023
Grant dateJan 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Hardware accelerators for accelerated grouped convolution operations. A first buffer of a hardware accelerator may receive a first row of an input feature map (IFM) from a memory. A first group comprising a plurality of tiles may receive a first row of the IFM. A plurality of processing elements of the first group may compute a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel. A second buffer of the accelerator may receive a third row of the IFM from the memory. A second group comprising a plurality of tiles may receive the third row of the IFM. A plurality of processing elements of the second group may compute a portion of a third row of the OFM based on the third row of the IFM and the kernel as part of a grouped convolution operation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving, by a first buffer of a hardware accelerator, a first row of an input feature map (IFM) from a memory; receiving, by a first group comprising a first plurality of tiles, a first row of the IFM from the first buffer, wherein each tile of the first group comprises a plurality of processing elements; computing, by the plurality of processing elements of the first plurality of tiles, at least a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel; receiving, by a second buffer, a third row of the IFM from the memory; receiving, by a second group comprising a plurality of tiles, the third row of the IFM from the second buffer, wherein each tile of the second group comprises a plurality of processing elements; and computing, by the plurality of processing elements of the second plurality of tiles, at least a portion of a third row of the OFM based on the third row of the IFM and the kernel, wherein the portions of the first and third rows of the OFM are computed by the hardware accelerator in parallel as part of a grouped convolution operation. 2. The method of claim 1 , wherein computing the portion of the first row of the OFM comprises: performing a multiply-accumulate (MAC) operation on the first row of the IFM and the kernel in a first cycle; shifting, in the first buffer, the first row of the IFM to produce a first shifted first row of the IFM; receiving, by the first group, the first shifted first row of the IFM; performing the MAC operation on the first shifted first row of the IFM and the kernel in a second cycle; shifting, in the first buffer, the first shifted first row of the IFM to produce a second shifted first row of the IFM; and performing the MAC operation on the second shifted first row of the IFM and the kernel in a third cycle. 3. The method of claim 2 , wherein computing the portion of the third row of the OFM comprises: performing the MAC operation on the third row of the IFM and the kernel in the second cycle; shifting, in the second buffer, the first row of the IFM to produce a first shifted third row of the IFM; receiving, by the second group, the first shifted third row of the IFM; performing the MAC operation on the first shifted third row of the IFM and the kernel in the third cycle; shifting, in the second buffer, the first shifted third row of the IFM to produce a second shifted third row of the IFM; and performing the MAC operation on the second shifted third row of the IFM and the kernel in a fourth cycle. 4. The method of claim 3 , further comprising: receiving, by the first buffer, a second row of the IFM from the memory; receiving, by the second buffer, a fourth row of the IFM from the memory; computing at least a second portion of the first row of the OFM based at least in part on the second row of the IFM; and computing at least a portion of a fourth row of the OFM based at least in part on the fourth row of the IFM, wherein the second portion of the first OFM and the portion of the fourth row of the OFM are computed by the hardware accelerator in parallel. 5. The method of claim 1 , further comprising: determining a size of the kernel or a stride size of the kernel specified in one or more configuration registers; generating the first group and the second group based at least in part on the size of the kernel or the stride size of the kernel; and organizing the tiles of the first group and the second group into a compute grid based at least in part on the size of the kernel or the stride size of the kernel. 6. The method of claim 1 , wherein the hardware accelerator comprises logic for a convolutional neural network, wherein the hardware accelerator is configured to perform grouped convolution operations using the convolutional neural network. 7. The method of claim 1 , further comprising: computing each of a plurality of rows of the OFM, wherein the first and third rows of the OFM are of the plurality of rows of the OFM; and storing the plurality of rows of the OFM in the memory. 8. An apparatus, comprising: memory; and a hardware accelerator coupled to the memory, the hardware accelerator comprising logic configured to: receive, by a first buffer of the hardware accelerator, a first row of an input feature map (IFM) from the memory; receive, by a first group comprising a first plurality of tiles, a first row of the IFM from the first buffer, wherein each tile of the first group comprises a plurality of processing elements; compute, by the plurality of processing elements of the first plurality of tiles, at least a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel; receive, by a second buffer, a third row of the IFM from the memory; receive, by a second group comprising a plurality of tiles, the third row of the IFM from the second buffer, wherein each tile of the second group comprises a plurality of processing elements; and compute, by the plurality of processing elements of the second plurality of tiles, at least a portion of a third row of the OFM based on the third row of the IFM and the kernel, wherein the portions of the first and third rows of the OFM are computed by the hardware accelerator in parallel as part of a grouped convolution operation. 9. The apparatus of claim 8 , the logic to compute the portion of the first row of the OFM to comprise logic to: perform a multiply-accumulate (MAC) operation on the first row of the IFM and the kernel in a first cycle; shift, in the first buffer, the first row of the IFM to produce a first shifted first row of the IFM; receive, by the first group, the first shifted first row of the IFM; perform the MAC operation on the first shifted first row of the IFM and the kernel in a second cycle; shift, in the first buffer, the first shifted first row of the IFM to produce a second shifted first row of the IFM; and perform the MAC operation on the second shifted first row of the IFM and the kernel in a third cycle. 10. The apparatus of claim 9 , the logic to compute the portion of the third row of the OFM to comprise logic to: perform the MAC operation on the third row of the IFM and the kernel in the second cycle; shift, in the second buffer, the first row of the IFM to produce a first shifted third row of the IFM; receive, by the second group, the first shifted third row of the IFM; perform the MAC operation on the first shifted third row of the IFM and the kernel in the third cycle; shift, in the second buffer, the first shifted third row of the IFM to produce a second shifted third row of the IFM; and perform the MAC operation on the second shifted third row of the IFM and the kernel in a fourth cycle. 11. The apparatus of claim 10 , the hardware accelerator comprising logic configured to: receive, by the first buffer, a second row of the IFM from the memory; receive, by the second buffer, a fourth row of the IFM from the memory; compute at least a second portion of the first row of the OFM based at least in part on the second row of the IFM; and compute at least a portion of a fourth row of the OFM based at least in part on the fourth row of the IFM, wherein the second portion of the first OFM and the portion of the fourth row of the OFM are computed by the hardware accelerator in parallel. 12. The apparatus of claim 8 , the hardware accelerator comprising logic configured to: determine a size of the kernel or a stride size of the kernel specified in one or more configuration registers; generate the first group and the second group based at least in part on the size

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Caching of specific data in cache memory · CPC title

  • with multilevel cache hierarchies · CPC title

  • Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

  • Performance improvement · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11544191B2 cover?
Hardware accelerators for accelerated grouped convolution operations. A first buffer of a hardware accelerator may receive a first row of an input feature map (IFM) from a memory. A first group comprising a plurality of tiles may receive a first row of the IFM. A plurality of processing elements of the first group may compute a portion of a first row of an output feature map (OFM) based on the …
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F12/0811. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).