Accelerator systems and methods for matrix operations
US-10942738-B2 · Mar 9, 2021 · US
US12346694B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12346694-B2 |
| Application number | US-202117304794-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 25, 2021 |
| Priority date | Jun 25, 2021 |
| Publication date | Jul 1, 2025 |
| Grant date | Jul 1, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.
Opening claim text (preview).
What is claimed is: 1. A processing apparatus including: a general-purpose parallel processing engine comprising: a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements. 2. The processing apparatus as in claim 1 , wherein the second read control circuit couples with a subset of the set of multiple processing elements and is additionally configured to arbitrate read requests to the second register file from the subset of the set of multiple processing elements, wherein the subset of the set of multiple processing elements includes less than all processing elements in the set of multiple processing elements. 3. The processing apparatus as in claim 2 , wherein the subset of the set of multiple processing elements includes the integer unit. 4. The processing apparatus as in claim 3 , wherein the set of multiple processing elements additionally include a math unit to perform a transcendental math operation and the subset of the set of multiple processing elements includes the math unit. 5. The processing apparatus as in claim 1 , wherein the matrix accelerator is to: receive a command to execute an instruction to perform a matrix operation on a set of input data; read a first set of input data from the first register file; read a second set of input data from the second register file; perform operations associated with the instruction according to the command; and write output of the operations to a register file selected from one of the first register file or the second register file. 6. The processing apparatus as in claim 5 , further comprising a write control circuit coupled with the matrix accelerator, the set of multiple processing elements, the first register file, and the second register file, wherein the write control circuit is to arbitrate write requests to the first register file and the second register file and the matrix accelerator is to write output of the operations via the write control circuit. 7. The processing apparatus as in claim 5 , wherein the operations associated with the instruction include one or more dot product operations. 8. The processing apparatus as in claim 7 , wherein to read the second set of input data includes to read a first set of matrix elements associated with a first matrix and a second set of matrix elements associated with a second matrix. 9. The processing apparatus as in claim 7 , wherein to read the first set of input data includes to read a value to add to a dot product computed by a pipeline stage of the one or more systolic arrays. 10. A method comprising: executing general-purpose parallel processing via a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating-point unit, and an integer unit; performing a matrix operation using a matrix accelerator including a systolic array; arbitrating read requests from the set of multiple processing elements and the matrix accelerator to a first register file via a first read control circuit; and arbitrating read requests from the matrix accelerator to a second register file via a second read control circuit, wherein the second read control circuit limits access to the second register file by the set of multiple processing elements. 11. The method as in claim 10 , further comprising performing the matrix operation via one or more pipeline stages of the systolic array within the matrix accelerator. 12. The method as in claim 11 , wherein performing the matrix operation includes performing a dot product operation on a first set of input data read from the first register file and a second set of input data read from the second register file via the one or more pipeline stages of the systolic array. 13. The method as in claim 12 , wherein reading the second set of input data includes reading a first set of matrix elements associated with a first matrix and a second set of matrix elements associated with a second matrix. 14. The method as in claim 13 , wherein reading the first set of input data includes reading a value to add to a dot product computed by the dot product operation. 15. A system comprising: a memory device; and a graphics processor coupled to the memory device, the graphics processor comprising a general-purpose parallel processing engine including: a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements. 16. The system as in claim 15 , wherein the second read control circuit couples with a subset of the set of multiple processing elements and is additionally configured to arbitrate read requests to the second register file from the subset of the set of multiple processing elements, wherein the subset of the set of multiple processing elements includes less than all processing elements in the set of multiple processing elements. 17. The system as in claim 16 , wherein the set of multiple processing elements additionally include a math unit to perform a transcendental math operation and the subset of the set of multiple processing elements includes the math unit and the integer unit. 18. The system as in claim 15 , wherein the matrix accelerator is to: receive a command to execute an instruction to perform a matrix operation on a set of input data; read a first set of input data from the first register file; read a second set of input data from the second register file; perform operations associated with the instruction according to the command; and write output of the operations to a register file selected from one of the first register file or the second register file. 19. The system as in claim 18 , further comprising a write control circuit coupled with the matrix accelerator, the set of multiple processing elements, the first register file, and the second register file, wherein the write control circuit is to arbitrate write requests to the first register file and the second register file and the matrix accelerator is to write output of the operations via the writ
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
using a mask · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title
organised in groups of units sharing resources, e.g. clusters · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.