Systems, methods, and apparatuses for tile matrix multiplication and accumulation
US-11086623-B2 · Aug 10, 2021 · US
US11263129B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11263129-B1 |
| Application number | US-201916526966-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jul 30, 2019 |
| Priority date | Sep 15, 2017 |
| Publication date | Mar 1, 2022 |
| Grant date | Mar 1, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.
Opening claim text (preview).
What is claimed is: 1. A processor comprising: a plurality of data communication lanes, each configured to carry data as part of a data flow from a configurable source to at least one configurable destination via a data signal path, in accordance with one or more received instructions; a memory region comprising a set of memory tiles organized into a plurality of memory slices, and a functional region comprising a set of functional tiles organized into a plurality of functional slices, wherein each data communication lane connects corresponding tiles of the plurality of memory slices and plurality of functional slices; wherein each memory tile is configurable to read operand data onto a respective data communication lane, and to write results data from the respective data communication lane into memory, and each functional tile is configured to process operand data received from a respective data communication lane, and output processed operand data onto the respective data communication lane, in accordance with one or more received instructions; wherein a data flow on a data communication lane of the plurality of data communication lanes is processed by a plurality of functional tiles within the data communication lane, the data flow comprising operand data from memory tiles of the data communication lane to at least a portion of the plurality of functional tiles in a first direction, and processed operand data to the memory tiles in a second direction for storage, in accordance with the one or more received instructions; wherein a first functional tile of the plurality of functional tiles processes operand data received from the data flow in accordance with instructions of the one or more instructions received via an instruction flow separate from the data flow, the received instructions having a predetermined temporal relationship with the received operand data, and automatically outputting processed operand data, wherein the output processed operand data is used as an input operand for a subsequent functional tile of the plurality of functional tiles of the data communication lane. 2. The processor of claim 1 , wherein the first functional tile processes received operand data in accordance with one or more instructions received via the instruction flow during a same clock cycle during which the operand data is received. 3. The processor of claim 1 , wherein the first functional tile processes received operand data in accordance with one or more instructions received via the instruction flow, the operand data and one or more instructions received separated by a predetermined delay. 4. The processor of claim 1 , wherein the instruction flow flows along a length of a first functional slice containing the first functional tile, in a direction perpendicular to the data flow. 5. The processor of claim 1 , wherein an operand is a multidimensional matrix, and wherein at least one functional tile of the plurality of functional tiles is configured to perform a matrix operation on the operand in accordance with a received instruction having the predetermined temporal relationship with the operand. 6. The processor of claim 1 , wherein an operand is a multi-element vector, and wherein at least one functional tile of the plurality of functional tiles processes elements of the operand in parallel in accordance with a received instruction having the predetermined temporal relationship with the operand. 7. The processor of claim 1 , further comprising a plurality of switching tiles configured to route data between different communication lanes. 8. The processor of claim 1 , wherein the data flow comprises a plurality of streams, and wherein at least one functional tile of the plurality of functional tiles is configured to receive operand data from multiple streams of a plurality of streams to produce one stream of results. 9. The processor of claim 1 , wherein the data flow comprises a plurality of streams, wherein at least one functional tile of the plurality of functional tiles is configured to receive a first number of operand data from multiple streams of the plurality of streams to produce one or more processed operand data. 10. The processor of claim 1 , wherein each memory tile of the set of memory tiles is configured to be able to execute two different memory operations during a single clock cycle, in accordance with one or more received instructions. 11. The processor of claim 1 , wherein each data communication lane further comprises at least one register or storage element corresponding to a plurality of memory tiles or a plurality of functional tiles, the at least one register or storage element configured to communicate operand data of the data flow across the data communication lane over one or more clock cycles. 12. The processor of claim 1 , wherein each memory slice and functional slice is associated with a respective instruction queue, each instruction queue configured to provide respective sets of instructions to tiles of its respective slice. 13. The processor of claim 12 , wherein the instruction queues of each of the plurality of memory slices and functional slices operate independently from each other. 14. The processor of claim 13 , wherein the plurality of memory slices and functional slices are coordinated with each other at a first time via a barrier synchronization operation. 15. The processor of claim 12 , wherein instructions from an instruction queue of a first slice of the plurality of slices are provided to the first slice tile-by-tile over a plurality of clock cycles, such that during a given clock cycle, each tile of the first slice receives a different instruction of the instruction queue. 16. The processor of claim 12 , wherein each tile of a first slice of the plurality of slices receives a given instruction of the respective instruction queue during a different clock cycle. 17. The processor of claim 1 , wherein: the memory region comprises a first memory portion having a first subset of the plurality of memory slices and a second memory portion having a second subset of the plurality of memory slices, the first and second memory portions located on opposite sides of a first functional portion of the functional region comprising a first subset of the plurality of functional slices. 18. The processor of claim 17 , wherein: the functional region further comprises at least a second functional portion comprising a second subset of the plurality of functional slices, the second functional portion located on a side of the first memory portion or second memory portion opposite from the first functional portion; and wherein the first subset of the plurality of functional tiles comprises functional tiles configured to perform vector operations on received operands, and the second subset of the plurality of functional tiles comprises functional tiles configured to perform matrix operations on received operands. 19. The processor of claim 1 , wherein: a slice of the plurality of memory slices and functional slices comprises at least a first subset of tiles corresponding to a first thread and a second subset of tiles corresponding to a second thread, the tiles of the slice being designated as part of first and second subsets based upon a received tile configuration instruction. 20. The processor of claim 19 , wherein the tiles of the slice receive instructions from an instruction queue associated with the slice, each instruction provided by the instruction queue associated with an identifier indicating whet
Two dimensional, e.g. mesh, torus · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
Machine learning · CPC title
with multidimensional access, e.g. row/column, matrix · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.