Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism

US11636327B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11636327-B2
Application numberUS-201715859203-A
CountryUS
Kind codeB2
Filing dateDec 29, 2017
Priority dateDec 29, 2017
Publication dateApr 25, 2023
Grant dateApr 25, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus to facilitate processing of a sparse matrix for arbitrary graph data is disclosed. The apparatus includes a graphics processing unit having a data management unit (DMU) that includes a scheduler for scheduling matrix operations, an active logic for tracking active input operands, and a skip logic for tracking unimportant input operands to be skipped by the scheduler. Processing circuitry is coupled to the DMU. The processing circuitry comprises a plurality of processing elements including logic to read operands and a multiplication unit to multiply two or more operands for the arbitrary graph data.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus to facilitate processing a sparse matrix for arbitrary graph data, comprising: a graphics processing unit, including: a data management unit (DMU) having a scheduler to schedule matrix operations, an active circuitry to track active input operands, and a skip circuitry to track zero and redundant input operands to be skipped by the scheduler; and processing circuitry coupled to the DMU, the processing circuitry comprising a plurality of processing elements including circuitry to read operands, and a multiplication unit to multiply two or more operands for the arbitrary graph data, wherein the DMU configures the processing circuitry coupled to the DMU to bypass an operation having zero or redundant input operands associated with an irregular neural network having an arbitrary connection across non-adjacent layers of the neural network. 2. The apparatus of claim 1 , wherein the scheduler to schedule non-zero and non-redundant operands at the multiplication unit. 3. The apparatus of claim 1 , further comprising: memory having pointer circuitry to store base pointers for input and output vectors; and memory to store input and output vectors. 4. The apparatus of claim 1 , wherein each processing element includes the circuitry to read operands, pointer circuitry for providing a column pointer to a memory address of a weighted coefficient of a matrix, data circuitry to generate and send a weighted coefficient value that is identified by the column pointer to the multiplication unit. 5. The apparatus of claim 4 , wherein the data circuitry sends an identifier of a memory address or a position of the output vector to an output buffer. 6. The apparatus of claim 1 , wherein the arbitrary connection across the non-adjacent layers of the neural network introduces the operations having the redundant or zero input operands. 7. A hardware accelerator to facilitate processing a sparse matrix for an arbitrary irregular neural network, comprising: a data management unit (DMU) having a scheduler to schedule matrix operations and an auxiliary buffer to store active input operands; and a plurality of processing elements coupled to the DMU, each processing element includes an input buffer for edge data and message data, and customizable circuitry to support an input vertex program for the arbitrary neural network, wherein the customizable circuitry to support an input vertex program supports an activate function. 8. The hardware accelerator of claim 7 , wherein the customizable circuitry to support an input vertex program additionally supports customized functions including multiply, accumulate and send message functions. 9. The hardware accelerator of claim 8 , wherein each processing element further comprises on-chip memory to receive vector data from off-chip memory via the DMU. 10. The hardware accelerator of claim 9 , wherein the DMU to obtain updated vector data from the on-chip memory based on the customized functions and then to send the updated vector data to the off-chip memory. 11. The hardware accelerator of claim 7 , wherein the hardware accelerator supports arbitrary connections across non-adjacent layers of the arbitrary irregular neural network. 12. A graphics processing unit, comprising: a sparsity management unit to manage sparsity operations, wherein the sparsity management unit comprises: a value check mechanism to detect unimportant values within input vectors, the unimportant values including zero operands and redundant operands, and skip operations for the unimportant values of the input vectors, and a scheduler to determine scheduling of computations based on scheduling important values and skipping unimportant values of input vectors that are detected by the value check mechanism; a block floating point (FP) management unit 3120 to support block FP operations; and a variable and mix precision compute unit to support variable and mix precision operations. 13. The graphics processing unit of claim 12 , wherein the scheduler is to bypass computations associated with unimportant values for an irregular neural network having an arbitrary connection across non-adjacent layers of the neural network. 14. The graphics processing unit of claim 12 , wherein the block FP management unit includes select circuitry to select a shared exponent for input vectors if the input vectors have block FP and thus different exponents. 15. The graphics processing unit of claim 14 , wherein the block FP management unit includes align circuitry to cause alignment of a mantissa for the input vector that has a change in exponent. 16. The graphics processing unit of claim 12 , wherein the variable and mix precision compute unit include computations units and accumulators to perform computations for input vectors, wherein the computations include at least one of spatial and temporal computations including any spatial and temporal combinations. 17. A method for training of data, comprising: obtaining a first sparse matrix encoded with compressed sparse row (CSR) and a second dense matrix; offloading the second dense matrix in a coalesced manner from memory to a shared local memory (SLM); determining a minimum number of workgroups to launch to minimize a number of redundant global memory loads to the SLM; selecting a work group size for each of the minimum number of workgroups; and launching the minimum number of workgroups for execution on a graphics processing unit (GPU), wherein the minimum number of workgroups is determined based on a total number of hardware threads supported by the GPU and a number of data elements associated with a hardware thread, wherein the redundant global memory loads are associated with redundant operands associated with an irregular neural network having an arbitrary connection across non-adjacent layers of the neural network. 18. The method of claim 17 , wherein the number of data elements associated with each hardware thread is determined based on a single instruction multiple data (SIMD) width associated with each hardware thread. 19. The method of claim 18 , further comprising: applying a load balancing technique for hardware threads such that each hardware thread completes a first block of data and processes a second block of data that is available. 20. The method of claim 19 , further comprising: generating outputs for a Sparse Dense general matrix vector multiplication (GEMV) GPU implementation for training of data.

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • G06N3/063Primary

    using electronic means · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • Supervised learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11636327B2 cover?
An apparatus to facilitate processing of a sparse matrix for arbitrary graph data is disclosed. The apparatus includes a graphics processing unit having a data management unit (DMU) that includes a scheduler for scheduling matrix operations, an active logic for tracking active input operands, and a skip logic for tracking unimportant input operands to be skipped by the scheduler. Processing cir…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 25 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).