Arithmetic unit for deep learning acceleration
US-11586907-B2 · Feb 21, 2023 · US
US11954580B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11954580-B2 |
| Application number | US-202017022950-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 16, 2020 |
| Priority date | Sep 16, 2020 |
| Publication date | Apr 9, 2024 |
| Grant date | Apr 9, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a method for machine learning acceleration includes receiving, by a shared controller of a tensor processor cluster that includes multiple tensor processors, a multi-cycle instruction, determining, based on the instruction, a sequence of vector operations to be executed by the tensor processors and address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation. The method also includes, for each vector operation in the sequence, generating, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition on which the tensor processor is to operate, multicasting the common address offset to the tensor processors, and controlling the tensor processors to execute the vector operation in parallel and in lock step.
Opening claim text (preview).
What is claimed is: 1. A system for machine learning acceleration, comprising: a plurality of tensor processor clusters, each comprising: a plurality of tensor processors; and a cluster-level controller configured to: receive a multi-cycle instruction, wherein each of the plurality of tensor processor clusters receives a respective multi-cycle instruction, and wherein the respective multi-cycle instructions are distributed across the plurality of tensor processor clusters in accordance with single-program-multiple-data (SPMD) parallelism such that at least two of the plurality of tensor processor clusters receive and execute different multi-cycle instructions while operating on an input feature map; determine, based on the multi-cycle instruction, (1) a sequence of vector operations to be executed by the tensor processors and (2) address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation; and for each vector operation in the sequence: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition of the input tensor on which the tensor processor is to operate; multicast the common address offset to the tensor processors; and control the tensor processors to execute the vector operation in lock step. 2. The system of claim 1 , wherein: a first multi-cycle instruction received by a given one of the plurality of tensor processor clusters represents a portion of a machine-learning program; the machine-learning program comprises a plurality of multi-cycle instructions, each of which is associated with one or more convolution operations to be performed in a respective layer in a convolutional neural network; and a first cycle of the first multi-cycle instruction is associated with a first convolution operation. 3. The system of claim 2 , wherein: the cluster-level controller in the given cluster is configured to: determine, based on the first multi-cycle instruction, (1) a first sequence of vector operations to be executed by the tensor processors in the given cluster and (2) address information usable to determine the respective spatial partition of an input tensor on which each tensor processor in the given cluster is to operate when performing each vector operation in the first sequence of vector operations; and for each vector operation in the first sequence of vector operations: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor in the given cluster, at which each tensor processor in the given cluster is to retrieve the respective spatial partition of the input tensor on which the tensor processor is to operate; multicast the common address offset to the tensor processors in the given cluster; and control the tensor processors in the given cluster to execute the vector operation in lock step; each of the vector operations in the first sequence of vector operations comprises one or more of: a vector read operation, a vector addition operation, and a vector multiply operation; and each tensor processor in the given cluster comprises a hardware compute array of multiply-and-accumulate (MAC) computation units configured to execute vector operations on the respective spatial partition of the input tensor on which the tensor processor is to operate when performing each vector operation in the first sequence of vector operations. 4. The system of claim 3 , wherein: the cluster-level controller in the given cluster is further configured to: determine, based on the first multi-cycle instruction, (3) weight information usable to determine weights to be applied in the one or more convolution operations associated with the first multi-cycle instruction; determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a first subset of the weights associated with the first multi-cycle instruction that are associated with the first convolution operation; and provide the first subset of the weights that are associated with the first convolution operation to the hardware compute array of at least one of the tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 5. The system of claim 4 , wherein: each tensor processor in the given cluster is configured to generate a respective spatial partition of an output tensor based on the respective spatial partition of the input tensor on which each tensor processor in the given cluster is to operate using single-instruction-multiple-data (SIMD) parallelism; to implement SIMD parallelism, each tensor processor in the given cluster is configured to implement data parallelism; and the cluster-level controller is further configured to provide the first subset of the weights that are associated with the first convolution operation to the hardware compute arrays of two or more of the tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 6. The system of claim 5 , wherein: a second cycle of the first multi-cycle instruction is associated with a second convolution operation; and the cluster-level controller in the given cluster is further configured to: determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a second subset of the weights associated with the first multi-cycle instruction that are associated with the second convolution operation; and provide the second subset of the weights that are associated with the second convolution operation to the hardware compute arrays of the two or more tensor processors in the given cluster for execution of the second convolution operation in the second cycle of the first multi-cycle instruction. 7. The system of claim 4 , wherein the cluster-level controller in the given cluster is further configured to: determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a second subset of the weights associated with the first multi-cycle instruction that are associated with the first convolution operation; and provide the second subset of the weights that are associated with the first convolution operation to the hardware compute array of one of the tensor processors in the given cluster other than the at least one of the one or more tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 8. The system of claim 2 , wherein the cluster-level controller in the given cluster is further configured to: receive a second multi-cycle instruction; determine, based on the second multi-cycle instruction, (1) a second sequence of vector operations to be executed by the tensor processors in the given cluster and (2) address information usable to determine the respective spatial partition of an input tensor on which each tensor processor in the given cluster is to operate when performing each vector operation in the second sequence of vector operations; and for each vector operation in the second sequence of vector operations: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor in the given cluster, at which each tensor processor is to retrieve t
Convolutional networks [CNN, ConvNet] · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
Forward inferencing; Production systems · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.