What technology area does this patent fall under?

Primary CPC classification G06N3/063. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Spatial tiling of compute arrays with shared control

US11954580B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11954580-B2
Application number	US-202017022950-A
Country	US
Kind code	B2
Filing date	Sep 16, 2020
Priority date	Sep 16, 2020
Publication date	Apr 9, 2024
Grant date	Apr 9, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method for machine learning acceleration includes receiving, by a shared controller of a tensor processor cluster that includes multiple tensor processors, a multi-cycle instruction, determining, based on the instruction, a sequence of vector operations to be executed by the tensor processors and address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation. The method also includes, for each vector operation in the sequence, generating, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition on which the tensor processor is to operate, multicasting the common address offset to the tensor processors, and controlling the tensor processors to execute the vector operation in parallel and in lock step.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for machine learning acceleration, comprising: a plurality of tensor processor clusters, each comprising: a plurality of tensor processors; and a cluster-level controller configured to: receive a multi-cycle instruction, wherein each of the plurality of tensor processor clusters receives a respective multi-cycle instruction, and wherein the respective multi-cycle instructions are distributed across the plurality of tensor processor clusters in accordance with single-program-multiple-data (SPMD) parallelism such that at least two of the plurality of tensor processor clusters receive and execute different multi-cycle instructions while operating on an input feature map; determine, based on the multi-cycle instruction, (1) a sequence of vector operations to be executed by the tensor processors and (2) address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation; and for each vector operation in the sequence: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition of the input tensor on which the tensor processor is to operate; multicast the common address offset to the tensor processors; and control the tensor processors to execute the vector operation in lock step. 2. The system of claim 1 , wherein: a first multi-cycle instruction received by a given one of the plurality of tensor processor clusters represents a portion of a machine-learning program; the machine-learning program comprises a plurality of multi-cycle instructions, each of which is associated with one or more convolution operations to be performed in a respective layer in a convolutional neural network; and a first cycle of the first multi-cycle instruction is associated with a first convolution operation. 3. The system of claim 2 , wherein: the cluster-level controller in the given cluster is configured to: determine, based on the first multi-cycle instruction, (1) a first sequence of vector operations to be executed by the tensor processors in the given cluster and (2) address information usable to determine the respective spatial partition of an input tensor on which each tensor processor in the given cluster is to operate when performing each vector operation in the first sequence of vector operations; and for each vector operation in the first sequence of vector operations: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor in the given cluster, at which each tensor processor in the given cluster is to retrieve the respective spatial partition of the input tensor on which the tensor processor is to operate; multicast the common address offset to the tensor processors in the given cluster; and control the tensor processors in the given cluster to execute the vector operation in lock step; each of the vector operations in the first sequence of vector operations comprises one or more of: a vector read operation, a vector addition operation, and a vector multiply operation; and each tensor processor in the given cluster comprises a hardware compute array of multiply-and-accumulate (MAC) computation units configured to execute vector operations on the respective spatial partition of the input tensor on which the tensor processor is to operate when performing each vector operation in the first sequence of vector operations. 4. The system of claim 3 , wherein: the cluster-level controller in the given cluster is further configured to: determine, based on the first multi-cycle instruction, (3) weight information usable to determine weights to be applied in the one or more convolution operations associated with the first multi-cycle instruction; determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a first subset of the weights associated with the first multi-cycle instruction that are associated with the first convolution operation; and provide the first subset of the weights that are associated with the first convolution operation to the hardware compute array of at least one of the tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 5. The system of claim 4 , wherein: each tensor processor in the given cluster is configured to generate a respective spatial partition of an output tensor based on the respective spatial partition of the input tensor on which each tensor processor in the given cluster is to operate using single-instruction-multiple-data (SIMD) parallelism; to implement SIMD parallelism, each tensor processor in the given cluster is configured to implement data parallelism; and the cluster-level controller is further configured to provide the first subset of the weights that are associated with the first convolution operation to the hardware compute arrays of two or more of the tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 6. The system of claim 5 , wherein: a second cycle of the first multi-cycle instruction is associated with a second convolution operation; and the cluster-level controller in the given cluster is further configured to: determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a second subset of the weights associated with the first multi-cycle instruction that are associated with the second convolution operation; and provide the second subset of the weights that are associated with the second convolution operation to the hardware compute arrays of the two or more tensor processors in the given cluster for execution of the second convolution operation in the second cycle of the first multi-cycle instruction. 7. The system of claim 4 , wherein the cluster-level controller in the given cluster is further configured to: determine, for at least one vector operation in the first sequence of vector operations and dependent on the weight information, a second subset of the weights associated with the first multi-cycle instruction that are associated with the first convolution operation; and provide the second subset of the weights that are associated with the first convolution operation to the hardware compute array of one of the tensor processors in the given cluster other than the at least one of the one or more tensor processors in the given cluster for execution of the first convolution operation in the first cycle of the first multi-cycle instruction. 8. The system of claim 2 , wherein the cluster-level controller in the given cluster is further configured to: receive a second multi-cycle instruction; determine, based on the second multi-cycle instruction, (1) a second sequence of vector operations to be executed by the tensor processors in the given cluster and (2) address information usable to determine the respective spatial partition of an input tensor on which each tensor processor in the given cluster is to operate when performing each vector operation in the second sequence of vector operations; and for each vector operation in the second sequence of vector operations: generate, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor in the given cluster, at which each tensor processor is to retrieve t

Assignees

Meta Platforms Inc

Inventors

Classifications

G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06F9/30036
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
G06F9/3887
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
G06N5/046
Forward inferencing; Production systems · CPC title
G06N3/084
Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

View patent family 76305774

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11954580B2 cover?: In one embodiment, a method for machine learning acceleration includes receiving, by a shared controller of a tensor processor cluster that includes multiple tensor processors, a multi-cycle instruction, determining, based on the instruction, a sequence of vector operations to be executed by the tensor processors and address information usable to determine a respective spatial partition of an i…
Who is the assignee on this patent?: Meta Platforms Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).