Systems and methods for deep learning processor
US-11055063-B2 · Jul 6, 2021 · US
US2019392297A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2019392297-A1 |
| Application number | US-201716474029-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 28, 2017 |
| Priority date | Dec 30, 2016 |
| Publication date | Dec 26, 2019 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A network of matrix processing units (MPUs) is provided on a device, where each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations. Computer memory stores tensor data and a master control central processing unit (MCC) is provided on the device to receive an instruction from a host device, where the instruction includes one or more tensor operands based on the tensor data. The MCC invokes a set of operations on one or more of the MPUs based on the instruction, where the set of operations includes operations on the tensor operands. A result is generated from the set of operations, the result embodied as a tensor value.
Opening claim text (preview).
1 . An apparatus comprising: a network of matrix processing units (MPUs), wherein each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations; a memory to store tensor data; and a master control central processing unit (MCC) to: receive an instruction from a host device, wherein the instruction comprises one or more tensor operands based on the tensor data; invoke a set of operations on one or more of the MPUs based on the instruction, wherein the set of operations comprises operations on the tensor operands; and output a result of the set of operations, wherein the result comprises a tensor value. 2 . The apparatus of claim 1 , wherein the MCC is further to send the result for storage in memory, wherein the result is stored as a tensor value in memory. 3 . The apparatus of claim 1 , wherein the MCC sends the result to the host device, and the host device comprises a host processor connected to the MCC. 4 . The apparatus of claim 1 , wherein the network of MPUs comprises a plurality of MPUs, and the MCC is to select a subset of the plurality of MPUs to perform the set of operations. 5 . The apparatus of claim 4 , wherein the subset of MPUs comprises two or more of the MPUs. 6 . The apparatus of claim 1 , wherein the instruction comprises a stream of instructions and the MCC is to coordinate data flow and a sequence of operations to be performed by the network of MPUs based on the stream of operations. 7 . The apparatus of claim 6 , wherein the sequence of operations comprises a sequence of tensor arithmetic operations. 8 . The apparatus of claim 7 , wherein the sequence of tensor operations comprises matrix-matrix operations. 9 . The apparatus of claim 1 , wherein the memory comprises a memory resource block to be shared by two or more MPUs in the network of MPUs. 10 . The apparatus of claim 9 , wherein invoking the set of operations comprises pointing one or more of the MPUs to the memory resource block to access the tensor data. 11 . The apparatus of claim 10 , wherein the set of operations comprise at least one of a row/column broadcast, block shifting, matrix copy, matrix transpose, and matrix expansion. 12 . The apparatus of claim 9 , wherein the memory comprises a super memory block (SMB) to group a plurality of memory resource blocks, and two or more MPUs in the network of MPUs have read/write access to the plurality of memory resource blocks in the SMB. 13 . The apparatus of claim 1 , further comprising a convolutional slicing engine to: interface with the memory; read a set of rows from the memory; flatten two-dimension data in the set of rows to generate a flat version of the two-dimensional data; and provide the two-dimensional data to one or more MPUs in the network of MPUs for use in a convolution operation performed using the one or more MPUs. 14 . The apparatus of claim 1 , further comprising an on-chip router to route data multi-directionally between components of the apparatus. 15 . The apparatus of claim 1 , wherein the memory comprises one or more barrel shifters to shift a matrix described in memory to target a read or write to a particular row or column of the matrix. 16 . The apparatus of claim 1 , wherein the set of operations comprises a max pooling operation. 17 . The apparatus of claim 1 , wherein the set of operations comprises performing a Winograd transformation on the operands and performing a matrix multiplication on the operands transformed by the Winograd transformation. 18 . The apparatus of claim 1 , wherein the tensor operand comprises a matrix and invoking the set of operations comprises partitioning the matrix and distributing the partitioned matrix to a plurality of MPUs in the network of MPUs to perform one or more of the set of operations on the partitioned matrix. 19 . The apparatus of claim 1 , wherein the tensor operands comprise a particular input matrix and the set of operations comprises a matrix dimension shuffle operation to reorder a plurality of dimensions of the particular input matrix. 20 . The apparatus of claim 1 , wherein at least a particular MPU in the network of MPUs comprises local memory to store a set of matrix subroutines, and the particular MPU is to: translate an operation received from the MCC into a subset of the matrix subroutines; and perform the operation through execution of the subset of the matrix subroutines. 21 . The apparatus of claim 1 , wherein the set of operations are used to implement one of a set of deep learning models, and the set of deep learning models comprises a multilayer perceptron model, a restricted Boltzmann machine model, a deep belief network models, an auto-encoder model, and a convolutional neural network. 22 . A method comprising: storing tensor data in memory, wherein the memory is accessible to a network of matrix processing units (MPUs); receiving an instruction from a host device, wherein the instruction comprises one or more tensor operands based on the tensor data; and causing a set of operations to be performed by one or more of the MPUs based on the instruction, wherein the set of operations comprise operations on the tensor operands; and generating a result from performance of the set of operations, wherein the result comprises a tensor value. 23 . (canceled) 24 . A system comprising: a deep learning processor comprising: a port to connect to a host processor; a plurality of interconnected matrix processing units (MPUs), wherein each MPU comprises circuitry to perform tensor arithmetic operations; a memory to store tensor data; and a master control central processing unit (MCC) to: receive an instruction from the host processor, wherein the instruction comprises one or more tensor operands based on the tensor data; and cause one or more of the MPUs to perform a set of operations based on the instruction, wherein the set of operations comprise operations on the tensor operands; and return a result of the set of operations to the host processor, wherein the result comprises a tensor value connected to the host. 25 . The system of claim 24 , further comprising the host processor. 26 . The system of claim 25 , wherein the system comprises a system on chip. 27 . The system of claim 25 , wherein the system comprises a server blade. 28 . (canceled) 29 . (canceled) 30 . (canceled)
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Combinations of networks · CPC title
using electronic means · CPC title
Architecture, e.g. interconnection topology · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.