System and Method for Data Warehouse Engine
US-2017169034-A1 · Jun 15, 2017 · US
US10649772B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10649772-B2 |
| Application number | US-201815941526-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 30, 2018 |
| Priority date | Mar 30, 2018 |
| Publication date | May 12, 2020 |
| Grant date | May 12, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed embodiments relate to a method and apparatus for efficient matrix transpose. In one example, a processor to execute a matrix transpose instruction includes fetch circuitry to fetch the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, a (N×M) load buffer, decode circuitry to decode the fetched matrix transpose instruction, and execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the specified source matrix: fetch and buffer N elements of the row in a load register, and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of the load buffer, and the execution circuitry subsequently to write each of N rows of the load buffer to a same row of the load buffer.
Opening claim text (preview).
What is claimed is: 1. A processor to execute a matrix transpose instruction, the processor comprising: fetch circuitry to fetch the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, the matrix transpose instruction further specifying M, N, and an element size being one of 1 byte, 2 bytes, 4 bytes, 8 bytes, and 16 bytes; a (N×M) load buffer; decode circuitry to decode the fetched matrix transpose instruction; and execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the source matrix: fetch and buffer N elements of the row in a load register distinct from the load buffer; and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of the load buffer; and the execution circuitry subsequently to write each of N rows of the load buffer to a same row of the destination matrix. 2. The processor of claim 1 , wherein the load buffer comprises a (N×M) matrix of registers within a reorder buffer of the processor, and wherein the execution circuitry is to: generate an intermediate transposed result by causing each of the M buffered rows to be written to a corresponding column M of the load buffer; and causing each of N rows of the matrix of registers to be written to a corresponding row N of the destination matrix. 3. The processor of claim 2 , wherein the load register comprises a load rotator of the processor, wherein the execution circuitry is to use the load register to buffer each of the M rows of the source matrix before causing the buffered row to be written to the load buffer. 4. The processor of claim 3 , wherein the execution circuitry is to: execute a first operation to, for each buffered row X of M rows of the source matrix; cause the row to be written to the load buffer in a diagonal, starting at a matrix location shifted left by X positions, and wrapping around the matrix when encountering an edge; and execute a second operation to rotate each row Y of N rows of the load buffer rightwards by Y positions; and wherein X ranges from zero to M minus one, and Y ranges from zero to N minus one. 5. The processor of claim 4 , wherein the execution circuitry is further to rotate each of the X rows by X positions in the load rotator, such that each of the N buffered elements is to line up with the load buffer column to which it will be written in the first operation. 6. The processor of claim 5 , wherein the load rotator is to rotate data received in response to misaligned memory loads; and wherein the processor issues at least some speculative instructions and executes at least some instructions out-of-order, and wherein the reorder buffer is to enqueue instructions upon their issue, and to dequeue instructions upon their retirement, and to thereby assist in-order retirement of instructions. 7. The processor of claim 1 , wherein matrix transpose instruction is a non-blocking instruction, and wherein the execution circuitry further comprises a matrix transpose engine to manage execution of the decoded matrix transpose instruction and allow a core pipeline of the processor to continue executing other instructions. 8. A method of executing a matrix transpose instruction by a processor, the method comprising: fetching, using fetch circuitry, the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, the matrix transpose instruction further specifying M, N, and an element size being one of 1 byte, 2 bytes, 4 bytes, 8 bytes, and 16 bytes; decoding, using decode circuitry, the fetched matrix transpose instruction; and executing, using execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the source matrix: fetch and buffer N elements of the row in a load register; and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of a load buffer, the load buffer being distinct from the load register and having (N×M) elements; and the execution circuitry subsequently to write each of N rows of the load buffer to a same row of the destination matrix. 9. The method of claim 8 , wherein the load buffer comprises a (N×M) matrix of registers within a reorder buffer of the processor, and wherein the execution circuitry is to: generate an intermediate transposed result by causing each of the M buffered rows to be written to a corresponding column M of the load buffer; and cause each of N rows of the load buffer to be written to a corresponding row N of the destination matrix. 10. The method of claim 9 , wherein the load register comprises a load rotator of the processor, wherein the execution circuitry is to use the load register to buffer each of the M buffered rows before causing the buffered row to be written to the load buffer. 11. The method of claim 10 , further comprising: executing, by the execution circuitry, a first operation to, for each buffered row X of the source matrix; cause the row to be written to the load buffer in a diagonal, starting at a matrix location shifted left by X positions, and wrapping around the matrix when encountering an edge; and executing, by the execution circuitry, a second operation to rotate each row Y of N rows of the load buffer rightwards by Y positions; and wherein X ranges from zero to M minus one, and Y ranges from zero to N minus one. 12. The method of claim 11 , further comprising rotating, by the execution circuitry, each of the X rows by X positions in the load rotator, such that each of the N buffered elements is to line up with the load buffer column to which it will be written in the first operation. 13. The method of claim 12 , wherein the load rotator is to rotate data received in response to misaligned memory loads; and wherein the processor issues at least some speculative instructions and executes at least some instructions out-of-order, and wherein the reorder buffer is to enqueue instructions upon their issue, and to dequeue instructions upon their retirement, and to thereby assist in-order retirement of instructions. 14. The method of claim 8 , wherein matrix transpose instruction is a non-blocking instruction, and wherein the execution circuitry further comprises a matrix transpose engine to manage execution of the decoded matrix transpose instruction and allow a core pipeline of the processor to continue executing other instructions. 15. A non-transitory machine-readable medium containing instructions that, when executed by a processor, cause the processor to execute a matrix transpose instruction by: fetching, using fetch circuitry, the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, the matrix transpose instruction further specifying M, N, and an element size being one of 1 byte, 2 bytes, 4 bytes, 8 bytes, and 16 bytes; decoding, using decode circuitry, the fetched matrix transpose instruction; and executing, using execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the source matrix: fetch and buffer N elements of the row in a load register; and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of a load buffer, the load buffer being distinct from the load register and having (N×M) elements; and the execution circuitry subsequently to write each of N rows of t
Pipelined decoding, e.g. using predecoding · CPC title
Instruction analysis, e.g. decoding, instruction word fields · CPC title
Vector or matrix data · CPC title
Register arrangements · CPC title
LOAD or STORE instructions; Clear instruction · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.