Computing convolutions using a neural network processor
US-10438117-B1 · Oct 8, 2019 · US
US11055063B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11055063-B2 |
| Application number | US-201715582420-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 28, 2017 |
| Priority date | May 2, 2016 |
| Publication date | Jul 6, 2021 |
| Grant date | Jul 6, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A hardware-based programmable deep learning processor (DLP) is proposed, wherein the DLP comprises with a plurality of accelerators dedicated for deep learning processing. Specifically, the DLP includes a plurality of tensor engines configured to perform operations for pattern recognition and classification based on a neural network. Each tensor engine includes one or more matrix multiplier (MatrixMul) engines each configured to perform a plurality of dense and/or sparse vector-matrix and matrix-matrix multiplication operations, one or more convolutional network (ConvNet) engines each configured to perform a plurality of efficient convolution operations on sparse or dense matrices, one or more vector floating point units (VectorFPUs) each configured to perform floating point vector operations, and a data engine configured to retrieve and store multi-dimensional data to both on-chip and external memories.
Opening claim text (preview).
What is claimed is: 1. A hardware-based programmable deep learning processor (DLP), comprising: an on-system memory (OSM) and one or more controllers configured to access a plurality of external memory resources via direct memory access (DMA); a plurality of programmable tensor engines configured to perform a plurality of operations on input data to generate deep learning processing results for pattern recognition and classification based on a neural network, wherein at least one or more programmable tensor engine of the plurality of tensor engines further comprises a plurality types of hardware engines to accelerate the operations on data at one or more layers of the neural network, wherein the types of hardware engines include: one or more matrix multiplier engines configured to perform a plurality of dense and/or sparse vector-matrix and matrix-matrix multiplication operations, and wherein the one or more of the matrix multiplier engines is configured to reduce the number of times the input data and a weight matrix need to be read and/or the output matrix needs to be written at the one or more layers of the neural network, wherein in a matrix-matrix multiplication operation an input matrix associated with the input data is N rows by M columns and the weight matrix associated therewith is M rows by K columns, and wherein a T row by a T column submatrix of the input matrix is multiplied by a T row by a T column submatrix of the weight matrix, and wherein the input matrix is read K/T times and wherein the weight matrix is read N/T times; one or more convolutional network engines configured to perform a plurality of convolution operations by applying a function to increase a sparsity of the vectors and/or matrices, and wherein the one or more convolutional network engines is configured to reduce a number of computations on zero values of the vectors and/or matrices; one or more vector floating point units configured to perform a vector operation in floating point format; a data engine configured to prefetch the input data from the OSM and/or the external memory resources. 2. The processor of claim 1 , wherein: the DLP is configured to multiplex the data prefetched from the OSM and/or the external memory resources to the at least one or more programmable tensor engine of the plurality of tensors engines via a crossbar. 3. The processor of claim 2 , wherein: the at least one or more programmable tensor engine of the plurality of tensor engines further includes a programmable CPU having its own instruction RAM and data RAM configured to store instructions from a host and the retrieved data from the OSM and/or the external memory resources, respectively. 4. The processor of claim 3 , wherein: the DLP is configured to accept a plurality of instructions from the host and submit the instructions to the at least one or more programmable tensor engine of the plurality of tensor engines and their respective components in the DLP via a DLP interface, wherein the instructions are stored in the instruction RAM of the tensor engines. 5. The processor of claim 3 , wherein: the DLP is also configured to provide the deep learning processing results by the DLP back to the host via the DLP interface. 6. The processor of claim 1 , wherein: the configuration of the neural network is dynamically adjusted based on current deep learning application of the DLP. 7. The processor of claim 1 , wherein: the neural network includes a plurality of layers each having a plurality of neurons connecting to neurons on a neighboring layer, wherein data processed progresses from one layer to the next in sequence along a processing pipeline. 8. The processor of claim 7 , wherein: the DLP is configured to trim the neural network by pruning the neurons at each layer of the neural network as well as edges connecting the neurons of different layers to create a compact neural network while maintaining accuracy of the neural network to reduce size of the vectors and/or the matrices to be multiplied by the matrix multiplier engines and the data that needs to be read from the memory. 9. The processor of claim 7 , wherein: the neural network utilized for convolution operations has three types of layers: one or more convolutional layers, each of which is configured to apply one or more local filters and/or a non-linear activation function to data from the input layer, one or more sub-sampling layers, each of which is configured to aggregate information amongst a set of neighbors of a neuron of the layer; one or more classification layers, each of which is configured to perform a linear or multi-layer perceptron (MLP) operation on the neural network and apply a non-linear activation function to output from the neuron. 10. The processor of claim 9 , wherein: one or more kernels are applied to source pixels in an image for image classification, wherein a center element of each kernel is placed over a source pixel to replace the source pixel with a weighted sum of the source pixel and its neighboring pixels. 11. The processor of claim 10 , wherein: each kernel is a multi-dimensional matrix having its own values for elements in the matrix, wherein the dimensions represent (x, y, time) coordinates as well as depth of the elements of the kernel. 12. The processor of claim 1 , wherein: the DLP is configured to partition each operation for pattern classification among the plurality of tensor engines, wherein the at least one or more programmable tensor engines of the plurality of programmable tensor engines is configured to perform a sub-task of the operation in parallel. 13. The processor of claim 12 , wherein: the DLP is configured to replicate a sub-task among multiple tensor engines or move a sub-task from one tensor engine to another for efficient use of compute resources. 14. The processor of claim 1 , wherein: each of the vector floating point units is a simplified arithmetic-logic unit (ALU) that handles on vector operations only and does not handle loops, branches, and branch predictions. 15. The processor of claim 1 , wherein: the one or more of the matrix multiplier engines is configured to perform one or more of: multiplication between a dense vector or matrix and a dense matrix, multiplication between a sparse vector and a dense matrix, and multiplication between a sparse vector and a sparse matrix, wherein a sparse vector or matrix has more zero elements than nonzero elements, while a dense vector or matrix has more nonzero elements than zero elements. 16. The processor of claim 1 , wherein: the one or more of the matrix multiplier engines is configured to reduce data movement associated with multiplication involving a sparse vector, wherein only data that corresponds to non-zero values in the sparse vector is loaded into the memory of the tensor engine upon request. 17. The processor of claim 1 , wherein: the at least one or more programmable tensor engines of the plurality of tensor engines is configured to reuse data in the memory across one or more of the convolutional network engines efficiently to reduce data movement for read and/or write to memory during the convolution operations. 18. The processor of claim 17 , wherein: the one or more convolutional network engines is configured to keep and repeatedly apply a same kernel on different parts of the input data at the at least one or more layers of the neural network wherein the kernel is loaded into the memory only once during the convolution operations. 19. The processor of claim
Combinations of networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.