Flexible accelerator for sparse tensors in convolutional neural networks
US-11462003-B2 · Oct 4, 2022 · US
US12430544B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12430544-B2 |
| Application number | US-202117460584-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 30, 2021 |
| Priority date | Aug 30, 2021 |
| Publication date | Sep 30, 2025 |
| Grant date | Sep 30, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention relates to a method and a system for performing depthwise separable convolution on an input data in a convolutional neural network. The invention utilizes a heterogeneous architecture with a number of MAC arrays including 1D MAC arrays and 2D MAC arrays with a Winograd conversion logic to perform depthwise separable convolution. The depthwise separable convolution uses less weight parameters and thus less multiplications while it obtains the same computation results as the traditional convolution.
Opening claim text (preview).
The invention claimed is: 1. A method for achieving high utilization of a neural network based computation using depthwise-separable convolution, wherein the method comprising: performing a point-wise convolution with a two dimensional MAC array on an input data for generating a first output within a spatial domain, wherein the first output is distributed and stored in a pipeline buffer; performing a depth-wise convolution with a one dimensional MAC array on the first output for generating a second output within a Winograd domain; performing a point-wise convolution with the two dimensional MAC array on the second output for generating a final output within the spatial domain from the Winograd domain; splitting the final output into a plurality of units by a processing unit; stripping one or more units of the plurality of units by the processing unit, wherein stripping the one or more units allow processing depthwise-separable convolution by a single DDR load and a single DDR store limits access to the DDR; and accumulating the one or more units of the plurality of units for computing the depthwise-separable convolution. 2. The method in accordance with claim 1 , wherein processing of the first output to the second output from the point-wise convolution to the depth-wise convolution is performed by using a number of buffers. 3. The method in accordance with claim 2 , wherein the number of buffers form a pseudo pipeline. 4. The method in accordance with claim 1 , wherein the conversion from the spatial domain to the Winograd domain is performed by an adder tree structure. 5. The method in accordance with claim 4 , wherein the conversion from the Winograd domain to the spatial domain is performed by an adder tree structure. 6. The method in accordance to claim 4 , wherein the adder tree structure supports different kernel sizes. 7. The method in accordance with claim 1 , wherein the neural network architecture is a heterogeneous architecture. 8. The method in accordance with claim 1 , wherein the depthwise-separable convolution reduces computation complexity and power demand. 9. A heterogeneous architecture for depthwise-separable convolution based neural network computation acceleration, wherein the heterogeneous architecture comprising: a plurality of MAC arrays to perform depthwise-separable convolution, wherein the depthwise-separable convolution, further wherein the plurality of MAC arrays comprising: one or more two-dimensional MAC-arrays for performing a point-wise convolution in a spatial domain, wherein the one or more two-dimensional MAC-arrays performs the point-wise convolution on an input data to generates a first output, wherein the first output is distributed and stored in a pipeline buffer; and one or more one-dimensional MAC-arrays for performing a depthwise convolution in a Winograd domain, wherein the one or more one-dimensional MAC-arrays performs the Winograd convolution on the first output to generate a second output, further wherein the one or more two-dimensional MAC-arrays performs the point-wise convolution on the second output with an adder tree structure to generate a final output; a processing unit, wherein the processing unit comprising: a splitting unit, wherein the splitting unit splits the final output into a plurality of tiles; a stripping unit, wherein the stripping unit strips one or more units of the plurality of tiles, wherein stripping the one or more units allow processing depthwise-separable convolution by a single DDR load and a single DDR store limits access to the DDR; and an accumulator, wherein the accumulator accumulates the one or more units of the plurality of tiles for computing the depthwise-separable convolution. 10. A computer program product comprising a non-transitory computer useable medium having computer program logic for enabling at least one processor in a computer system for performing a high utilization of a neural network based computation using depthwise-separable convolution, said computer program logic comprising: performing a point-wise convolution with a two dimensional MAC array on an input data for generating a first output within a spatial domain, wherein the first output is distributed and stored in a pipeline buffer; performing a depth-wise convolution with a one dimensional MAC array on the first output for generating a second output within a Winograd domain; performing a point-wise convolution with the two dimensional MAC array on the second output for generating a final output within the spatial domain from the Winograd domain; splitting the final output into a plurality of units by a processing unit; stripping one or more units of the plurality of units by the processing unit, wherein stripping the one or more units allow processing depthwise-separable convolution by a single DDR load and a single DDR store limits access to the DDR; and accumulating the one or more units of the plurality of units for computing the depthwise-separable convolution.
Activation functions · CPC title
modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Energy efficient computing, e.g. low power processors, power management or thermal management · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.