What technology area does this patent fall under?

Primary CPC classification G06F12/0811. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Efficient hardware architecture for accelerating grouped convolutions

US11544191B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11544191-B2
Application number	US-202016830457-A
Country	US
Kind code	B2
Filing date	Mar 26, 2020
Priority date	Mar 26, 2020
Publication date	Jan 3, 2023
Grant date	Jan 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Hardware accelerators for accelerated grouped convolution operations. A first buffer of a hardware accelerator may receive a first row of an input feature map (IFM) from a memory. A first group comprising a plurality of tiles may receive a first row of the IFM. A plurality of processing elements of the first group may compute a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel. A second buffer of the accelerator may receive a third row of the IFM from the memory. A second group comprising a plurality of tiles may receive the third row of the IFM. A plurality of processing elements of the second group may compute a portion of a third row of the OFM based on the third row of the IFM and the kernel as part of a grouped convolution operation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving, by a first buffer of a hardware accelerator, a first row of an input feature map (IFM) from a memory; receiving, by a first group comprising a first plurality of tiles, a first row of the IFM from the first buffer, wherein each tile of the first group comprises a plurality of processing elements; computing, by the plurality of processing elements of the first plurality of tiles, at least a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel; receiving, by a second buffer, a third row of the IFM from the memory; receiving, by a second group comprising a plurality of tiles, the third row of the IFM from the second buffer, wherein each tile of the second group comprises a plurality of processing elements; and computing, by the plurality of processing elements of the second plurality of tiles, at least a portion of a third row of the OFM based on the third row of the IFM and the kernel, wherein the portions of the first and third rows of the OFM are computed by the hardware accelerator in parallel as part of a grouped convolution operation. 2. The method of claim 1 , wherein computing the portion of the first row of the OFM comprises: performing a multiply-accumulate (MAC) operation on the first row of the IFM and the kernel in a first cycle; shifting, in the first buffer, the first row of the IFM to produce a first shifted first row of the IFM; receiving, by the first group, the first shifted first row of the IFM; performing the MAC operation on the first shifted first row of the IFM and the kernel in a second cycle; shifting, in the first buffer, the first shifted first row of the IFM to produce a second shifted first row of the IFM; and performing the MAC operation on the second shifted first row of the IFM and the kernel in a third cycle. 3. The method of claim 2 , wherein computing the portion of the third row of the OFM comprises: performing the MAC operation on the third row of the IFM and the kernel in the second cycle; shifting, in the second buffer, the first row of the IFM to produce a first shifted third row of the IFM; receiving, by the second group, the first shifted third row of the IFM; performing the MAC operation on the first shifted third row of the IFM and the kernel in the third cycle; shifting, in the second buffer, the first shifted third row of the IFM to produce a second shifted third row of the IFM; and performing the MAC operation on the second shifted third row of the IFM and the kernel in a fourth cycle. 4. The method of claim 3 , further comprising: receiving, by the first buffer, a second row of the IFM from the memory; receiving, by the second buffer, a fourth row of the IFM from the memory; computing at least a second portion of the first row of the OFM based at least in part on the second row of the IFM; and computing at least a portion of a fourth row of the OFM based at least in part on the fourth row of the IFM, wherein the second portion of the first OFM and the portion of the fourth row of the OFM are computed by the hardware accelerator in parallel. 5. The method of claim 1 , further comprising: determining a size of the kernel or a stride size of the kernel specified in one or more configuration registers; generating the first group and the second group based at least in part on the size of the kernel or the stride size of the kernel; and organizing the tiles of the first group and the second group into a compute grid based at least in part on the size of the kernel or the stride size of the kernel. 6. The method of claim 1 , wherein the hardware accelerator comprises logic for a convolutional neural network, wherein the hardware accelerator is configured to perform grouped convolution operations using the convolutional neural network. 7. The method of claim 1 , further comprising: computing each of a plurality of rows of the OFM, wherein the first and third rows of the OFM are of the plurality of rows of the OFM; and storing the plurality of rows of the OFM in the memory. 8. An apparatus, comprising: memory; and a hardware accelerator coupled to the memory, the hardware accelerator comprising logic configured to: receive, by a first buffer of the hardware accelerator, a first row of an input feature map (IFM) from the memory; receive, by a first group comprising a first plurality of tiles, a first row of the IFM from the first buffer, wherein each tile of the first group comprises a plurality of processing elements; compute, by the plurality of processing elements of the first plurality of tiles, at least a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel; receive, by a second buffer, a third row of the IFM from the memory; receive, by a second group comprising a plurality of tiles, the third row of the IFM from the second buffer, wherein each tile of the second group comprises a plurality of processing elements; and compute, by the plurality of processing elements of the second plurality of tiles, at least a portion of a third row of the OFM based on the third row of the IFM and the kernel, wherein the portions of the first and third rows of the OFM are computed by the hardware accelerator in parallel as part of a grouped convolution operation. 9. The apparatus of claim 8 , the logic to compute the portion of the first row of the OFM to comprise logic to: perform a multiply-accumulate (MAC) operation on the first row of the IFM and the kernel in a first cycle; shift, in the first buffer, the first row of the IFM to produce a first shifted first row of the IFM; receive, by the first group, the first shifted first row of the IFM; perform the MAC operation on the first shifted first row of the IFM and the kernel in a second cycle; shift, in the first buffer, the first shifted first row of the IFM to produce a second shifted first row of the IFM; and perform the MAC operation on the second shifted first row of the IFM and the kernel in a third cycle. 10. The apparatus of claim 9 , the logic to compute the portion of the third row of the OFM to comprise logic to: perform the MAC operation on the third row of the IFM and the kernel in the second cycle; shift, in the second buffer, the first row of the IFM to produce a first shifted third row of the IFM; receive, by the second group, the first shifted third row of the IFM; perform the MAC operation on the first shifted third row of the IFM and the kernel in the third cycle; shift, in the second buffer, the first shifted third row of the IFM to produce a second shifted third row of the IFM; and perform the MAC operation on the second shifted third row of the IFM and the kernel in a fourth cycle. 11. The apparatus of claim 10 , the hardware accelerator comprising logic configured to: receive, by the first buffer, a second row of the IFM from the memory; receive, by the second buffer, a fourth row of the IFM from the memory; compute at least a second portion of the first row of the OFM based at least in part on the second row of the IFM; and compute at least a portion of a fourth row of the OFM based at least in part on the fourth row of the IFM, wherein the second portion of the first OFM and the portion of the fourth row of the OFM are computed by the hardware accelerator in parallel. 12. The apparatus of claim 8 , the hardware accelerator comprising logic configured to: determine a size of the kernel or a stride size of the kernel specified in one or more configuration registers; generate the first group and the second group based at least in part on the size

Assignees

Intel Corp

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06F2212/45
Caching of specific data in cache memory · CPC title
G06F12/0811Primary
with multilevel cache hierarchies · CPC title
G06F7/5443
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
G06F2212/1016
Performance improvement · CPC title

Patent family

Related publications grouped by family.

View patent family 71608924

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11544191B2 cover?: Hardware accelerators for accelerated grouped convolution operations. A first buffer of a hardware accelerator may receive a first row of an input feature map (IFM) from a memory. A first group comprising a plurality of tiles may receive a first row of the IFM. A plurality of processing elements of the first group may compute a portion of a first row of an output feature map (OFM) based on the …
Who is the assignee on this patent?: Intel Corp
What technology area does this patent fall under?: Primary CPC classification G06F12/0811. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Methods and apparatus to tile walk a tensor for convolution operations

Inner product convolutional neural network accelerator

Efficient memory layout for enabling smart data compression in machine learning environments

Layer-based operations scheduling to optimise memory for CNN applications

Frequently asked questions