Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism

US12380326B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12380326-B2
Application numberUS-202418620400-A
CountryUS
Kind codeB2
Filing dateMar 28, 2024
Priority dateDec 29, 2017
Publication dateAug 5, 2025
Grant dateAug 5, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus to facilitate processing of a sparse matrix for arbitrary graph data is disclosed. The apparatus includes a graphics processing unit having a data management unit (DMU) that includes a scheduler for scheduling matrix operations, an active logic for tracking active input operands, and a skip logic for tracking unimportant input operands to be skipped by the scheduler. Processing circuitry is coupled to the DMU. The processing circuitry comprises a plurality of processing elements including logic to read operands and a multiplication unit to multiply two or more operands for the arbitrary graph data and customizable circuitry to provide custom functions.

First claim

Opening claim text (preview).

What is claimed is: 1. A graphics processor comprising: a data management unit (DMU) having a scheduler to schedule matrix operations; and a plurality of processing elements coupled to the DMU, the DMU configured to: determine a number of workgroups to launch; select a work group size for each of the number of workgroups; and launch the number of workgroups for execution via the plurality of processing elements, wherein the number of workgroups is determined based on a total number of hardware threads supported by the plurality of processing elements and a number of data elements associated with a hardware thread, each workgroup to configure a processing element of the plurality of processing elements to: obtain a first matrix and a second matrix, wherein the first matrix is a sparse matrix that is encoded in a compressed tensor representation and the second matrix a dense matrix; and offload the second matrix in a coalesced manner from memory to a shared local memory (SLM), wherein the number of work groups to launch is determined to reduce a number of redundant memory loads from a global memory to the SLM associated with the offload of the second matrix. 2. The graphics processor as in claim 1 , wherein the redundant global memory loads are associated with redundant operands associated with an irregular neural network having a connection across non-adjacent layers of the neural network. 3. The graphics processor as in claim 1 , wherein the number of data elements associated with each hardware thread is determined based on a single instruction multiple data (SIMD) width associated with each hardware thread. 4. The graphics processor as in claim 3 , wherein the DMU includes circuitry to load balance between hardware threads. 5. The graphics processor as in claim 4 , wherein each hardware thread is configured to process multiple software threads. 6. The graphics processor as in claim 5 , wherein a hardware thread is configured to complete an operation on a first block of data and processes a second block of data if the second block of data is available to be processed. 7. The graphics processor as in claim 6 , wherein the hardware thread is configured to process one of 8, 16, or 32 software threads. 8. The graphics processor as in claim 7 , wherein the DMU includes circuitry to track redundant input operands to be skipped by the scheduler. 9. The graphics processor as in claim 8 , the scheduler to determine scheduling of computations based on skipping redundant operations associated with the second matrix. 10. The graphics processor as in claim 9 , further comprising memory having pointer circuitry to store base pointers for input and output vectors and memory to store input and output vectors, the DMU to load input vectors associated with the first matrix and the second matrix and determine redundant input operations to be skipped by the scheduler. 11. A hardware accelerator, comprising: a memory device including pointer circuitry to store base pointers for input vectors and output vectors and memory to store input vectors and output vectors; a data management unit (DMU) including a scheduler to schedule matrix operations and a buffer to store active input operands, the scheduler to determine scheduling of computations based on skipping redundant operations; and a plurality of processing elements coupled to the DMU, the DMU configured to: determine a number of workgroups to launch; select a work group size for each of the number of workgroups; and launch the number of workgroups for execution via the plurality of processing elements, wherein the number of workgroups is determined based on a total number of hardware threads supported by the plurality of processing elements and a number of data elements associated with a hardware thread, and each workgroup of the number of workgroups is to configure a processing element of the plurality of processing elements to: obtain a first matrix and a second matrix, wherein the first matrix is a sparse matrix that is encoded in a compressed tensor representation and the second matrix a dense matrix; offload the second matrix in a coalesced manner from memory to a shared local memory (SLM), wherein the number of work groups to launch is determined to reduce a number of redundant memory loads from a global memory to the SLM associated with the offload of the second matrix; load the input vectors associated with the first matrix and the second matrix; and determine the redundant operations to be skipped by the scheduler. 12. The hardware accelerator as in claim 11 , wherein the redundant global memory loads are associated with redundant operands associated with an irregular neural network having a connection across non-adjacent layers of the neural network. 13. The hardware accelerator as in claim 11 , wherein the number of data elements associated with each hardware thread is determined based on a single instruction multiple data (SIMD) width associated with each hardware thread. 14. The hardware accelerator as in claim 13 , wherein the DMU includes circuitry to load balance between hardware threads. 15. The hardware accelerator as in claim 14 , wherein each hardware thread is configured to process multiple software threads. 16. The hardware accelerator as in claim 15 , wherein a hardware thread is configured to complete an operation on a first block of data and processes a second block of data if the second block of data is available to be processed. 17. The hardware accelerator as in claim 16 , wherein the hardware thread is configured to process one of 8, 16, or 32 software threads. 18. The hardware accelerator as in claim 17 , wherein the redundant operations to be skipped by the scheduler include redundant input operations associated with the second matrix and the DMU includes circuitry to track redundant input operands to be skipped by the scheduler. 19. A method of accelerating a hardware-based matrix operation, comprising: storing base pointers for input and output vectors and input and output vectors in a memory device; scheduling matrix operations using a scheduler in a data management unit (DMU), the scheduling including determining a number of workgroups to launch based on a total number of hardware threads supported by a plurality of processing elements and a number of data elements associated with a hardware thread and determine scheduling of computations based on skipping redundant operations; and launching the number of workgroups for execution via the plurality of processing elements, wherein each workgroup of the number of workgroups configures a processing element of the plurality of processing elements to perform operations comprising: obtaining a first matrix and a second matrix, the first matrix being a sparse matrix encoded in a compressed tensor representation and the second matrix being a dense matrix; offloading the second matrix from memory to a shared local memory (SLM) in a coalesced manner, the number of workgroups determined to reduce redundant memory loads from a global memory to the SLM; loading input vectors associated with the first matrix and the second matrix; and determining the redundant operations to be skipped by the scheduler in the DMU. 20. The method of claim 19 , comprising: configuring the DMU to store active input operands in a buffer; and selecting a work group size for each of the number of workgroups, wherein the number of redundant operations to be skilled by the scheduler in the DMU include redundant

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • G06N3/063Primary

    using electronic means · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • Supervised learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12380326B2 cover?
An apparatus to facilitate processing of a sparse matrix for arbitrary graph data is disclosed. The apparatus includes a graphics processing unit having a data management unit (DMU) that includes a scheduler for scheduling matrix operations, an active logic for tracking active input operands, and a skip logic for tracking unimportant input operands to be skipped by the scheduler. Processing cir…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 05 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).