Deep learning hardware

US2019392297A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2019392297-A1
Application numberUS-201716474029-A
CountryUS
Kind codeA1
Filing dateDec 28, 2017
Priority dateDec 30, 2016
Publication dateDec 26, 2019
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A network of matrix processing units (MPUs) is provided on a device, where each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations. Computer memory stores tensor data and a master control central processing unit (MCC) is provided on the device to receive an instruction from a host device, where the instruction includes one or more tensor operands based on the tensor data. The MCC invokes a set of operations on one or more of the MPUs based on the instruction, where the set of operations includes operations on the tensor operands. A result is generated from the set of operations, the result embodied as a tensor value.

First claim

Opening claim text (preview).

1 . An apparatus comprising: a network of matrix processing units (MPUs), wherein each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations; a memory to store tensor data; and a master control central processing unit (MCC) to: receive an instruction from a host device, wherein the instruction comprises one or more tensor operands based on the tensor data; invoke a set of operations on one or more of the MPUs based on the instruction, wherein the set of operations comprises operations on the tensor operands; and output a result of the set of operations, wherein the result comprises a tensor value. 2 . The apparatus of claim 1 , wherein the MCC is further to send the result for storage in memory, wherein the result is stored as a tensor value in memory. 3 . The apparatus of claim 1 , wherein the MCC sends the result to the host device, and the host device comprises a host processor connected to the MCC. 4 . The apparatus of claim 1 , wherein the network of MPUs comprises a plurality of MPUs, and the MCC is to select a subset of the plurality of MPUs to perform the set of operations. 5 . The apparatus of claim 4 , wherein the subset of MPUs comprises two or more of the MPUs. 6 . The apparatus of claim 1 , wherein the instruction comprises a stream of instructions and the MCC is to coordinate data flow and a sequence of operations to be performed by the network of MPUs based on the stream of operations. 7 . The apparatus of claim 6 , wherein the sequence of operations comprises a sequence of tensor arithmetic operations. 8 . The apparatus of claim 7 , wherein the sequence of tensor operations comprises matrix-matrix operations. 9 . The apparatus of claim 1 , wherein the memory comprises a memory resource block to be shared by two or more MPUs in the network of MPUs. 10 . The apparatus of claim 9 , wherein invoking the set of operations comprises pointing one or more of the MPUs to the memory resource block to access the tensor data. 11 . The apparatus of claim 10 , wherein the set of operations comprise at least one of a row/column broadcast, block shifting, matrix copy, matrix transpose, and matrix expansion. 12 . The apparatus of claim 9 , wherein the memory comprises a super memory block (SMB) to group a plurality of memory resource blocks, and two or more MPUs in the network of MPUs have read/write access to the plurality of memory resource blocks in the SMB. 13 . The apparatus of claim 1 , further comprising a convolutional slicing engine to: interface with the memory; read a set of rows from the memory; flatten two-dimension data in the set of rows to generate a flat version of the two-dimensional data; and provide the two-dimensional data to one or more MPUs in the network of MPUs for use in a convolution operation performed using the one or more MPUs. 14 . The apparatus of claim 1 , further comprising an on-chip router to route data multi-directionally between components of the apparatus. 15 . The apparatus of claim 1 , wherein the memory comprises one or more barrel shifters to shift a matrix described in memory to target a read or write to a particular row or column of the matrix. 16 . The apparatus of claim 1 , wherein the set of operations comprises a max pooling operation. 17 . The apparatus of claim 1 , wherein the set of operations comprises performing a Winograd transformation on the operands and performing a matrix multiplication on the operands transformed by the Winograd transformation. 18 . The apparatus of claim 1 , wherein the tensor operand comprises a matrix and invoking the set of operations comprises partitioning the matrix and distributing the partitioned matrix to a plurality of MPUs in the network of MPUs to perform one or more of the set of operations on the partitioned matrix. 19 . The apparatus of claim 1 , wherein the tensor operands comprise a particular input matrix and the set of operations comprises a matrix dimension shuffle operation to reorder a plurality of dimensions of the particular input matrix. 20 . The apparatus of claim 1 , wherein at least a particular MPU in the network of MPUs comprises local memory to store a set of matrix subroutines, and the particular MPU is to: translate an operation received from the MCC into a subset of the matrix subroutines; and perform the operation through execution of the subset of the matrix subroutines. 21 . The apparatus of claim 1 , wherein the set of operations are used to implement one of a set of deep learning models, and the set of deep learning models comprises a multilayer perceptron model, a restricted Boltzmann machine model, a deep belief network models, an auto-encoder model, and a convolutional neural network. 22 . A method comprising: storing tensor data in memory, wherein the memory is accessible to a network of matrix processing units (MPUs); receiving an instruction from a host device, wherein the instruction comprises one or more tensor operands based on the tensor data; and causing a set of operations to be performed by one or more of the MPUs based on the instruction, wherein the set of operations comprise operations on the tensor operands; and generating a result from performance of the set of operations, wherein the result comprises a tensor value. 23 . (canceled) 24 . A system comprising: a deep learning processor comprising: a port to connect to a host processor; a plurality of interconnected matrix processing units (MPUs), wherein each MPU comprises circuitry to perform tensor arithmetic operations; a memory to store tensor data; and a master control central processing unit (MCC) to: receive an instruction from the host processor, wherein the instruction comprises one or more tensor operands based on the tensor data; and cause one or more of the MPUs to perform a set of operations based on the instruction, wherein the set of operations comprise operations on the tensor operands; and return a result of the set of operations to the host processor, wherein the result comprises a tensor value connected to the host. 25 . The system of claim 24 , further comprising the host processor. 26 . The system of claim 25 , wherein the system comprises a system on chip. 27 . The system of claim 25 , wherein the system comprises a server blade. 28 . (canceled) 29 . (canceled) 30 . (canceled)

Assignees

Inventors

Classifications

  • Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • Combinations of networks · CPC title

  • G06N3/063Primary

    using electronic means · CPC title

  • Architecture, e.g. interconnection topology · CPC title

  • Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019392297A1 cover?
A network of matrix processing units (MPUs) is provided on a device, where each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations. Computer memory stores tensor data and a master control central processing unit (MCC) is provided on the device to receive an instruction from a host device, where the instruction includes one or m…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).