Method and apparatus for efficient matrix transpose

US10649772B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10649772-B2
Application numberUS-201815941526-A
CountryUS
Kind codeB2
Filing dateMar 30, 2018
Priority dateMar 30, 2018
Publication dateMay 12, 2020
Grant dateMay 12, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed embodiments relate to a method and apparatus for efficient matrix transpose. In one example, a processor to execute a matrix transpose instruction includes fetch circuitry to fetch the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, a (N×M) load buffer, decode circuitry to decode the fetched matrix transpose instruction, and execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the specified source matrix: fetch and buffer N elements of the row in a load register, and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of the load buffer, and the execution circuitry subsequently to write each of N rows of the load buffer to a same row of the load buffer.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor to execute a matrix transpose instruction, the processor comprising: fetch circuitry to fetch the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, the matrix transpose instruction further specifying M, N, and an element size being one of 1 byte, 2 bytes, 4 bytes, 8 bytes, and 16 bytes; a (N×M) load buffer; decode circuitry to decode the fetched matrix transpose instruction; and execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the source matrix: fetch and buffer N elements of the row in a load register distinct from the load buffer; and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of the load buffer; and the execution circuitry subsequently to write each of N rows of the load buffer to a same row of the destination matrix. 2. The processor of claim 1 , wherein the load buffer comprises a (N×M) matrix of registers within a reorder buffer of the processor, and wherein the execution circuitry is to: generate an intermediate transposed result by causing each of the M buffered rows to be written to a corresponding column M of the load buffer; and causing each of N rows of the matrix of registers to be written to a corresponding row N of the destination matrix. 3. The processor of claim 2 , wherein the load register comprises a load rotator of the processor, wherein the execution circuitry is to use the load register to buffer each of the M rows of the source matrix before causing the buffered row to be written to the load buffer. 4. The processor of claim 3 , wherein the execution circuitry is to: execute a first operation to, for each buffered row X of M rows of the source matrix; cause the row to be written to the load buffer in a diagonal, starting at a matrix location shifted left by X positions, and wrapping around the matrix when encountering an edge; and execute a second operation to rotate each row Y of N rows of the load buffer rightwards by Y positions; and wherein X ranges from zero to M minus one, and Y ranges from zero to N minus one. 5. The processor of claim 4 , wherein the execution circuitry is further to rotate each of the X rows by X positions in the load rotator, such that each of the N buffered elements is to line up with the load buffer column to which it will be written in the first operation. 6. The processor of claim 5 , wherein the load rotator is to rotate data received in response to misaligned memory loads; and wherein the processor issues at least some speculative instructions and executes at least some instructions out-of-order, and wherein the reorder buffer is to enqueue instructions upon their issue, and to dequeue instructions upon their retirement, and to thereby assist in-order retirement of instructions. 7. The processor of claim 1 , wherein matrix transpose instruction is a non-blocking instruction, and wherein the execution circuitry further comprises a matrix transpose engine to manage execution of the decoded matrix transpose instruction and allow a core pipeline of the processor to continue executing other instructions. 8. A method of executing a matrix transpose instruction by a processor, the method comprising: fetching, using fetch circuitry, the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, the matrix transpose instruction further specifying M, N, and an element size being one of 1 byte, 2 bytes, 4 bytes, 8 bytes, and 16 bytes; decoding, using decode circuitry, the fetched matrix transpose instruction; and executing, using execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the source matrix: fetch and buffer N elements of the row in a load register; and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of a load buffer, the load buffer being distinct from the load register and having (N×M) elements; and the execution circuitry subsequently to write each of N rows of the load buffer to a same row of the destination matrix. 9. The method of claim 8 , wherein the load buffer comprises a (N×M) matrix of registers within a reorder buffer of the processor, and wherein the execution circuitry is to: generate an intermediate transposed result by causing each of the M buffered rows to be written to a corresponding column M of the load buffer; and cause each of N rows of the load buffer to be written to a corresponding row N of the destination matrix. 10. The method of claim 9 , wherein the load register comprises a load rotator of the processor, wherein the execution circuitry is to use the load register to buffer each of the M buffered rows before causing the buffered row to be written to the load buffer. 11. The method of claim 10 , further comprising: executing, by the execution circuitry, a first operation to, for each buffered row X of the source matrix; cause the row to be written to the load buffer in a diagonal, starting at a matrix location shifted left by X positions, and wrapping around the matrix when encountering an edge; and executing, by the execution circuitry, a second operation to rotate each row Y of N rows of the load buffer rightwards by Y positions; and wherein X ranges from zero to M minus one, and Y ranges from zero to N minus one. 12. The method of claim 11 , further comprising rotating, by the execution circuitry, each of the X rows by X positions in the load rotator, such that each of the N buffered elements is to line up with the load buffer column to which it will be written in the first operation. 13. The method of claim 12 , wherein the load rotator is to rotate data received in response to misaligned memory loads; and wherein the processor issues at least some speculative instructions and executes at least some instructions out-of-order, and wherein the reorder buffer is to enqueue instructions upon their issue, and to dequeue instructions upon their retirement, and to thereby assist in-order retirement of instructions. 14. The method of claim 8 , wherein matrix transpose instruction is a non-blocking instruction, and wherein the execution circuitry further comprises a matrix transpose engine to manage execution of the decoded matrix transpose instruction and allow a core pipeline of the processor to continue executing other instructions. 15. A non-transitory machine-readable medium containing instructions that, when executed by a processor, cause the processor to execute a matrix transpose instruction by: fetching, using fetch circuitry, the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, the matrix transpose instruction further specifying M, N, and an element size being one of 1 byte, 2 bytes, 4 bytes, 8 bytes, and 16 bytes; decoding, using decode circuitry, the fetched matrix transpose instruction; and executing, using execution circuitry, responsive to the decoded matrix transpose instruction to, for each row X of M rows of the source matrix: fetch and buffer N elements of the row in a load register; and cause the N buffered elements to be written, in the same relative order as in the row, to column X of M columns of a load buffer, the load buffer being distinct from the load register and having (N×M) elements; and the execution circuitry subsequently to write each of N rows of t

Assignees

Inventors

Classifications

  • Pipelined decoding, e.g. using predecoding · CPC title

  • Instruction analysis, e.g. decoding, instruction word fields · CPC title

  • Vector or matrix data · CPC title

  • Register arrangements · CPC title

  • LOAD or STORE instructions; Clear instruction · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10649772B2 cover?
Disclosed embodiments relate to a method and apparatus for efficient matrix transpose. In one example, a processor to execute a matrix transpose instruction includes fetch circuitry to fetch the matrix transpose instruction specifying a destination matrix and a source matrix having (N×M) elements and (M×N) elements, respectively, a (N×M) load buffer, decode circuitry to decode the fetched matri…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/30032. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 12 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).