Systems, methods, and apparatuses for tile load, multiplication and accumulation

US12182571B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12182571-B2
Application numberUS-202318100194-A
CountryUS
Kind codeB2
Filing dateJan 23, 2023
Priority dateMar 20, 2017
Publication dateDec 31, 2024
Grant dateDec 31, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.

First claim

Opening claim text (preview).

We claim: 1. An apparatus comprising: a memory interface; matrix processing circuitry coupled to a memory via the memory interface, the matrix processing circuitry to execute instructions to perform matrix multiplication operations with a first source matrix comprising a first plurality of data elements and a second source matrix comprising a second plurality of data elements, wherein the first source matrix comprises a first plurality of matrix tiles and the second source matrix comprises a second plurality of matrix tiles, each matrix tile in the first plurality of matrix tiles comprising a subset of non-overlapping data elements of the first plurality of data elements and each matrix tile in the second plurality of matrix tiles comprising a subset of non-overlapping data elements of the second plurality of data elements; and a first plurality of vector registers to store a first tile comprising a first subset of non-overlapping data elements of the first plurality of data elements, a second plurality of vector registers to store a second tile comprising a second subset of non-overlapping data elements of the second plurality of data elements, and a third plurality of vector registers to store a result matrix tile comprising a plurality of result data elements; the matrix processing circuitry to multiply each data element of the first subset of non-overlapping data elements with a corresponding data element of the second subset of non-overlapping data elements to generate a corresponding plurality of products, and to add one or more products of the corresponding plurality of products to a corresponding accumulation data element to generate a corresponding result data element of the plurality of result data elements of the result matrix tile, wherein tile usage of the matrix processing circuitry is to be configured by an execution of a configuration instruction prior to the matrix processing circuitry to multiply each data element of the first subset of non-overlapping data elements with a corresponding data element of the second subset of non-overlapping data elements to generate a corresponding plurality of products, and to add one or more products of the corresponding plurality of products to a corresponding accumulation data element to generate a corresponding result data element of the plurality of result data elements of the result matrix tile, wherein tile usage at least includes to configure the matrix processing circuitry to handle particular tile dimensions as determined from a configuration accessed by the execution of the configuration instruction. 2. The apparatus of claim 1 wherein the matrix processing circuitry is to execute one or more load instructions to load the first subset of non-overlapping data elements of the first plurality of data elements from memory into the first plurality of vector registers and to load the second subset of non-overlapping data elements of the second plurality of data elements from memory into the second plurality of vector registers. 3. The apparatus of claim 2 further comprising: decode circuitry to decode the one or more load instructions to load the first subset of non-overlapping data elements of the first plurality of data elements from memory into the first plurality of vector registers and to load the second subset of non-overlapping data elements of the second plurality of data elements from memory into the second plurality of vector registers, each load instruction including a first operand to specify a corresponding subset of the first or second plurality of vector registers. 4. The apparatus of claim 2 , wherein the one or more load instructions are to load 64-bit data elements from memory locations generated using a base and an index into the first plurality of vector registers and the second plurality of vector registers. 5. The apparatus of claim 1 wherein the first subset of non-overlapping data elements of the first plurality of data elements are to be stored in the first plurality of vector registers in column-major order and the second subset of non-overlapping data elements of the second plurality of data elements are to be stored in the second plurality of vector registers in row-major order. 6. The apparatus of claim 1 wherein each data element of the first subset of non-overlapping data elements and each data element of the second subset of non-overlapping data elements comprises a first size and each data element of the result data elements comprises a second size which is at least twice the first size. 7. The apparatus of claim 6 wherein the first and second sizes are specified in at least one opcode executed by the matrix processing circuitry. 8. The apparatus of claim 7 , wherein the size of each data element of the result data elements is a doubleword. 9. The apparatus of claim 8 , wherein the size of each data element of the first subset of non-overlapping data elements and each data element of the second subset of non-overlapping data elements comprises a word. 10. The apparatus of claim 9 wherein each data element of the first subset of non-overlapping data elements and each data element of the second subset of non-overlapping data elements comprises a half-precision floating-point value. 11. A system comprising: a memory interface; a plurality of cores coupled to the memory interface, one or more cores of the plurality of cores to execute program code to schedule matrix multiplication operations; matrix processing circuitry coupled to a memory via the memory interface, the matrix processing circuitry to execute instructions to perform the matrix multiplication operations with a first source matrix comprising a first plurality of data elements and a second source matrix comprising a second plurality of data elements, wherein the first source matrix comprises a first plurality of matrix tiles and the second source matrix comprises a second plurality of matrix tiles, each matrix tile in the first plurality of matrix tiles comprising a subset of non-overlapping data elements of the first plurality of data elements and each matrix tile in the second plurality of matrix tiles comprising a subset of non-overlapping data elements of the second plurality of data elements; and a first plurality of vector registers to store a first tile comprising a first subset of non-overlapping data elements of the first plurality of data elements, a second plurality of vector registers to store a second tile comprising a second subset of non-overlapping data elements of the second plurality of data elements, and a third plurality of vector registers to store a result matrix tile comprising a plurality of result data elements; the matrix processing circuitry to multiply each data element of the first subset of non-overlapping data elements with a corresponding data element of the second subset of non-overlapping data elements to generate a corresponding plurality of products, and to add one or more products of the corresponding plurality of products to a corresponding accumulation data element to generate a corresponding result data element of the plurality of result data elements of the result matrix tile, wherein tile usage of the matrix processing circuitry is to be configured by an execution of a configuration instruction prior to the matrix processing circuitry to multiply each data element of the first subset of non-overlapping data elements with a corresponding data element of the second subset of non-overlapping data elements to generate a corresponding plurality of products, and to add one or more products of the corresponding plurality of products to a corresponding accumulation data element to generate a corresponding result data element

Assignees

Inventors

Classifications

  • Image or video data · CPC title

  • Vector or matrix data · CPC title

  • Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

  • with multidimensional access, e.g. row/column, matrix · CPC title

  • Recovery, e.g. branch miss-prediction, exception handling (error detection or correction G06F11/00) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12182571B2 cover?
Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to l…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 31 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).