Spatial tiling of compute arrays with shared control

US11954580B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11954580-B2
Application numberUS-202017022950-A
CountryUS
Kind codeB2
Filing dateSep 16, 2020
Priority dateSep 16, 2020
Publication dateApr 9, 2024
Grant dateApr 9, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method for machine learning acceleration includes receiving, by a shared controller of a tensor processor cluster that includes multiple tensor processors, a multi-cycle instruction, determining, based on the instruction, a sequence of vector operations to be executed by the tensor processors and address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation. The method also includes, for each vector operation in the sequence, generating, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition on which the tensor processor is to operate, multicasting the common address offset to the tensor processors, and controlling the tensor processors to execute the vector operation in parallel and in lock step.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for machine learning acceleration, comprising: a plurality of tensor processor clusters, each comprising: a plurality of tensor processors; and a cluster-level controller configured to: receive a multi-cycle instruction, wherein each of the plurality of tensor processor clusters receives a respective multi-cycle instruction, and wherein the respective multi-cycle instructions are distributed across the plurality of tensor processor clusters in accordance with single-program-multiple-data (SPMD) parallelism such that at least two of the plurality of tensor processor clusters receive and execute different multi-cycle instructions while operating on an input feature map; determine, based on the multi-cycle instruction, (1) a sequence of vector operations to be executed by the tensor processors and (2) address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation; and for each vector operation in the sequence: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition of the input tensor on which the tensor processor is to operate; multicast the common address offset to the tensor processors; and control the tensor processors to execute the vector operation in lock step. 2. The system of claim 1 , wherein: a first multi-cycle instruction received by a given one of the plurality of tensor processor clusters represents a portion of a machine-learning program; the machine-learning program comprises a plurality of multi-cycle instructions, each of which is associated with one or more convolution operations to be performed in a respective layer in a convolutional neural network; and a first cycle of the first multi-cycle instruction is associated with a first convolution operation. 3. The system of claim 2 , wherein: the cluster-level controller in the given cluster is configured to: determine, based on the first multi-cycle instruction, (1) a first sequence of vector operations to be executed by the tensor processors in the given cluster and (2) address information usable to determine the respective spatial partition of an input tensor on which each tensor processor in the given cluster is to operate when performing each vector operation in the first sequence of vector operations; and for each vector operation in the first sequence of vector operations: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor in the given cluster, at which each tensor processor in the given cluster is to retrieve the respective spatial partition of the input tensor on which the tensor processor is to operate; multicast the common address offset to the tensor processors in the given cluster; and control the tensor processors in the given cluster to execute the vector operation in lock step; each of the vector operations in the first sequence of vector operations comprises one or more of: a vector read operation, a vector addition operation, and a vector multiply operation; and each tensor processor in the given cluster comprises a hardware compute array of multiply-and-accumulate (MAC) computation units configured to execute vector operations on the respective spatial partition of the input tensor on which the tensor processor is to operate when performing each vector operation in the first sequence of vector operations. 4. The system of claim 3 , wherein: the cluster-level controller in the given cluster is further configured to: determine, based on the first multi-cycle instruction, (3) weight information usable to determine weights to be applied in the one or more convolution operations associated with the first multi-cycle instruction; determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a first subset of the weights associated with the first multi-cycle instruction that are associated with the first convolution operation; and provide the first subset of the weights that are associated with the first convolution operation to the hardware compute array of at least one of the tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 5. The system of claim 4 , wherein: each tensor processor in the given cluster is configured to generate a respective spatial partition of an output tensor based on the respective spatial partition of the input tensor on which each tensor processor in the given cluster is to operate using single-instruction-multiple-data (SIMD) parallelism; to implement SIMD parallelism, each tensor processor in the given cluster is configured to implement data parallelism; and the cluster-level controller is further configured to provide the first subset of the weights that are associated with the first convolution operation to the hardware compute arrays of two or more of the tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 6. The system of claim 5 , wherein: a second cycle of the first multi-cycle instruction is associated with a second convolution operation; and the cluster-level controller in the given cluster is further configured to: determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a second subset of the weights associated with the first multi-cycle instruction that are associated with the second convolution operation; and provide the second subset of the weights that are associated with the second convolution operation to the hardware compute arrays of the two or more tensor processors in the given cluster for execution of the second convolution operation in the second cycle of the first multi-cycle instruction. 7. The system of claim 4 , wherein the cluster-level controller in the given cluster is further configured to: determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a second subset of the weights associated with the first multi-cycle instruction that are associated with the first convolution operation; and provide the second subset of the weights that are associated with the first convolution operation to the hardware compute array of one of the tensor processors in the given cluster other than the at least one of the one or more tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 8. The system of claim 2 , wherein the cluster-level controller in the given cluster is further configured to: receive a second multi-cycle instruction; determine, based on the second multi-cycle instruction, (1) a second sequence of vector operations to be executed by the tensor processors in the given cluster and (2) address information usable to determine the respective spatial partition of an input tensor on which each tensor processor in the given cluster is to operate when performing each vector operation in the second sequence of vector operations; and for each vector operation in the second sequence of vector operations: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor in the given cluster, at which each tensor processor is to retrieve t

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Forward inferencing; Production systems · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11954580B2 cover?
In one embodiment, a method for machine learning acceleration includes receiving, by a shared controller of a tensor processor cluster that includes multiple tensor processors, a multi-cycle instruction, determining, based on the instruction, a sequence of vector operations to be executed by the tensor processors and address information usable to determine a respective spatial partition of an i…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).