Systems and methods for deep learning processor

US11055063B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11055063-B2
Application numberUS-201715582420-A
CountryUS
Kind codeB2
Filing dateApr 28, 2017
Priority dateMay 2, 2016
Publication dateJul 6, 2021
Grant dateJul 6, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A hardware-based programmable deep learning processor (DLP) is proposed, wherein the DLP comprises with a plurality of accelerators dedicated for deep learning processing. Specifically, the DLP includes a plurality of tensor engines configured to perform operations for pattern recognition and classification based on a neural network. Each tensor engine includes one or more matrix multiplier (MatrixMul) engines each configured to perform a plurality of dense and/or sparse vector-matrix and matrix-matrix multiplication operations, one or more convolutional network (ConvNet) engines each configured to perform a plurality of efficient convolution operations on sparse or dense matrices, one or more vector floating point units (VectorFPUs) each configured to perform floating point vector operations, and a data engine configured to retrieve and store multi-dimensional data to both on-chip and external memories.

First claim

Opening claim text (preview).

What is claimed is: 1. A hardware-based programmable deep learning processor (DLP), comprising: an on-system memory (OSM) and one or more controllers configured to access a plurality of external memory resources via direct memory access (DMA); a plurality of programmable tensor engines configured to perform a plurality of operations on input data to generate deep learning processing results for pattern recognition and classification based on a neural network, wherein at least one or more programmable tensor engine of the plurality of tensor engines further comprises a plurality types of hardware engines to accelerate the operations on data at one or more layers of the neural network, wherein the types of hardware engines include: one or more matrix multiplier engines configured to perform a plurality of dense and/or sparse vector-matrix and matrix-matrix multiplication operations, and wherein the one or more of the matrix multiplier engines is configured to reduce the number of times the input data and a weight matrix need to be read and/or the output matrix needs to be written at the one or more layers of the neural network, wherein in a matrix-matrix multiplication operation an input matrix associated with the input data is N rows by M columns and the weight matrix associated therewith is M rows by K columns, and wherein a T row by a T column submatrix of the input matrix is multiplied by a T row by a T column submatrix of the weight matrix, and wherein the input matrix is read K/T times and wherein the weight matrix is read N/T times; one or more convolutional network engines configured to perform a plurality of convolution operations by applying a function to increase a sparsity of the vectors and/or matrices, and wherein the one or more convolutional network engines is configured to reduce a number of computations on zero values of the vectors and/or matrices; one or more vector floating point units configured to perform a vector operation in floating point format; a data engine configured to prefetch the input data from the OSM and/or the external memory resources. 2. The processor of claim 1 , wherein: the DLP is configured to multiplex the data prefetched from the OSM and/or the external memory resources to the at least one or more programmable tensor engine of the plurality of tensors engines via a crossbar. 3. The processor of claim 2 , wherein: the at least one or more programmable tensor engine of the plurality of tensor engines further includes a programmable CPU having its own instruction RAM and data RAM configured to store instructions from a host and the retrieved data from the OSM and/or the external memory resources, respectively. 4. The processor of claim 3 , wherein: the DLP is configured to accept a plurality of instructions from the host and submit the instructions to the at least one or more programmable tensor engine of the plurality of tensor engines and their respective components in the DLP via a DLP interface, wherein the instructions are stored in the instruction RAM of the tensor engines. 5. The processor of claim 3 , wherein: the DLP is also configured to provide the deep learning processing results by the DLP back to the host via the DLP interface. 6. The processor of claim 1 , wherein: the configuration of the neural network is dynamically adjusted based on current deep learning application of the DLP. 7. The processor of claim 1 , wherein: the neural network includes a plurality of layers each having a plurality of neurons connecting to neurons on a neighboring layer, wherein data processed progresses from one layer to the next in sequence along a processing pipeline. 8. The processor of claim 7 , wherein: the DLP is configured to trim the neural network by pruning the neurons at each layer of the neural network as well as edges connecting the neurons of different layers to create a compact neural network while maintaining accuracy of the neural network to reduce size of the vectors and/or the matrices to be multiplied by the matrix multiplier engines and the data that needs to be read from the memory. 9. The processor of claim 7 , wherein: the neural network utilized for convolution operations has three types of layers: one or more convolutional layers, each of which is configured to apply one or more local filters and/or a non-linear activation function to data from the input layer, one or more sub-sampling layers, each of which is configured to aggregate information amongst a set of neighbors of a neuron of the layer; one or more classification layers, each of which is configured to perform a linear or multi-layer perceptron (MLP) operation on the neural network and apply a non-linear activation function to output from the neuron. 10. The processor of claim 9 , wherein: one or more kernels are applied to source pixels in an image for image classification, wherein a center element of each kernel is placed over a source pixel to replace the source pixel with a weighted sum of the source pixel and its neighboring pixels. 11. The processor of claim 10 , wherein: each kernel is a multi-dimensional matrix having its own values for elements in the matrix, wherein the dimensions represent (x, y, time) coordinates as well as depth of the elements of the kernel. 12. The processor of claim 1 , wherein: the DLP is configured to partition each operation for pattern classification among the plurality of tensor engines, wherein the at least one or more programmable tensor engines of the plurality of programmable tensor engines is configured to perform a sub-task of the operation in parallel. 13. The processor of claim 12 , wherein: the DLP is configured to replicate a sub-task among multiple tensor engines or move a sub-task from one tensor engine to another for efficient use of compute resources. 14. The processor of claim 1 , wherein: each of the vector floating point units is a simplified arithmetic-logic unit (ALU) that handles on vector operations only and does not handle loops, branches, and branch predictions. 15. The processor of claim 1 , wherein: the one or more of the matrix multiplier engines is configured to perform one or more of: multiplication between a dense vector or matrix and a dense matrix, multiplication between a sparse vector and a dense matrix, and multiplication between a sparse vector and a sparse matrix, wherein a sparse vector or matrix has more zero elements than nonzero elements, while a dense vector or matrix has more nonzero elements than zero elements. 16. The processor of claim 1 , wherein: the one or more of the matrix multiplier engines is configured to reduce data movement associated with multiplication involving a sparse vector, wherein only data that corresponds to non-zero values in the sparse vector is loaded into the memory of the tensor engine upon request. 17. The processor of claim 1 , wherein: the at least one or more programmable tensor engines of the plurality of tensor engines is configured to reuse data in the memory across one or more of the convolutional network engines efficiently to reduce data movement for read and/or write to memory during the convolution operations. 18. The processor of claim 17 , wherein: the one or more convolutional network engines is configured to keep and repeatedly apply a same kernel on different parts of the input data at the at least one or more layers of the neural network wherein the kernel is loaded into the memory only once during the convolution operations. 19. The processor of claim

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title

  • G06F17/16Primary

    Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11055063B2 cover?
A hardware-based programmable deep learning processor (DLP) is proposed, wherein the DLP comprises with a plurality of accelerators dedicated for deep learning processing. Specifically, the DLP includes a plurality of tensor engines configured to perform operations for pattern recognition and classification based on a neural network. Each tensor engine includes one or more matrix multiplier (Ma…
Who is the assignee on this patent?
Marvell Asia Pte Ltd
What technology area does this patent fall under?
Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 06 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).