Sparse convolutional neural network accelerator
US-10891538-B2 · Jan 12, 2021 · US
US12499347B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12499347-B2 |
| Application number | US-202318471843-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 21, 2023 |
| Priority date | Apr 9, 2017 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus to facilitate workload scheduling is disclosed. The apparatus includes one or more clients, one or more processing units to processes workloads received from the one or more clients, including hardware resources and scheduling logic to schedule direct access of the hardware resources to the one or more clients to process the workloads.
Opening claim text (preview).
What is claimed is: 1 . A graphics processing unit comprising: a shared memory; a memory interface coupled with the shared memory; and a processing cluster including a plurality of graphics multiprocessors internal to the graphics processing unit, the processing cluster coupled with the shared memory via the memory interface, the plurality of graphics multiprocessors coupled via a data interconnect, the data interconnect to facilitate exchange of data between the plurality of graphics multiprocessors during cooperative execution of a workload, the plurality of graphics multiprocessors configured to process workloads received for execution, wherein a graphics multiprocessor of the plurality of graphics multiprocessors includes: a plurality of processing engines; a scheduler to schedule direct access to the plurality of processing engines to process the workloads, wherein the workloads are each associated with a precompiled neural network (NN) kernel; and a gather unit to bypass zero data values and gather non-zero data values associated with the workloads, the non-zero data values stored sparsely in memory. 2 . The graphics processing unit of claim 1 , wherein the non-zero data values are data values for a convolutional kernel to be multiplied by data elements of a feature map. 3 . The graphics processing unit of claim 2 , wherein convolutional kernel is an irregular convolutional kernel and the plurality of processing engines are configured to multiply the data values for the convolutional kernel by the data elements of the feature map. 4 . The graphics processing unit of claim 2 , wherein the scheduler is to schedule the workloads to the plurality of processing engines based on priority and a submission type associated with the workloads. 5 . The graphics processing unit of claim 2 , wherein the plurality of graphics multiprocessors are associated with driver logic to facilitate access to the plurality of graphics multiprocessors by one or more clients, the one or more clients registered to the driver logic, the one or more clients to bypass an operating system to access the plurality of graphics multiprocessors via a function pointer received from the driver logic. 6 . The graphics processing unit of claim 5 , wherein each of the one or more clients receives a function pointer to enable direct access to the plurality of processing engines. 7 . The graphics processor processing unit of claim 5 , wherein each of the one or more clients include an input interface to the plurality of graphics multiprocessors. 8 . The graphics processing unit of claim 1 , wherein the gather unit is to store a map to the non-zero data values and gather the non-zero data values based on the map. 9 . A method to facilitate workload scheduling, comprising: receiving a request to access plurality of processing engines of a general purpose graphics processing unit, the general purpose graphics processing unit including a processing cluster having a plurality of graphics multiprocessors internal to the general purpose graphics processing unit, the plurality of graphics multiprocessors coupled via a data interconnect, the data interconnect to facilitate exchange of data between the plurality of graphics multiprocessors during cooperative execution of a workload, and a graphics multiprocessor of the plurality of graphics multiprocessors includes the plurality of processing engines; scheduling direct access to the plurality of processing engines to enable a client to process a workload provided by the client, wherein the workload is associated with a precompiled neural network (NN) kernel; and gathering, via a gather unit of the general purpose graphics processing unit, non-zero data values associated with the client while bypassing zero data values associated with the client, the non-zero data values stored sparsely in memory. 10 . The method of claim 9 , further comprising gathering the non-zero data values via a map to the non-zero data values. 11 . The method of claim 9 , further comprising bypassing an operating system and scheduling direct access to the plurality of processing engines via a Kernel Mode Driver (KMD) associated with the general purpose graphics processing unit, wherein the KMD provides a function pointer to enable direct access to the plurality of processing engines. 12 . The method of claim 9 , wherein access is provided to the client based on a priority and a submission client type. 13 . The method of claim 12 , further comprising registering the client with driver logic associated with the general purpose graphics processing unit. 14 . The method as in claim 9 , wherein the non-zero data values are data values for a convolutional kernel to be multiplied by data elements of a feature map. 15 . The method as in claim 14 , further comprising multiplying, via the plurality of processing engines of the general purpose graphics processing unit, the data values for the convolutional kernel by the data elements of the feature map. 16 . A data processing system comprising: memory to store instructions; and one or more processors configured to execute the instructions, wherein the one or more processors include a graphics processing unit including a processing cluster having a plurality of graphics multiprocessors internal to the graphics processing unit, the plurality of graphics multiprocessors coupled via a data interconnect, the data interconnect to facilitate exchange of data between the plurality of graphics multiprocessors during cooperative execution of a workload, and the instructions configure the one or more processors to: receive a request to access plurality of processing engines of a general purpose graphics processing unit; schedule, via a scheduler, direct access to the plurality of processing engines to enable the graphics processing unit to process a workload provided by a client, wherein the workload is associated with a precompiled neural network (NN) kernel; and gather, via a gather unit of the general purpose graphics processing unit, non-zero data values associated with the client while bypassing zero data values associated with the client, the non-zero data values stored sparsely in memory. 17 . The data processing system as in claim 16 , wherein the non-zero data values are data values for a convolutional kernel to be multiplied by data elements of a feature map. 18 . The data processing system of claim 17 , wherein convolutional kernel is an irregular convolutional kernel and the plurality of processing engines are configured to multiply the data values for the convolutional kernel by the data elements of the feature map. 19 . The data processing system of claim 17 , wherein the scheduler is to provide access to the plurality of processing engines based on a priority and a submission client type. 20 . The data processing system of claim 17 , wherein the graphics processing unit is associated with driver logic to facilitate access to the plurality of processing engines and the client is registered to the driver logic, the client to bypass an operating system to access the plurality of graphics multiprocessors via a function pointer received from the driver logic.
Backpropagation, e.g. using gradient descent · CPC title
using electronic means · CPC title
considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration (scheduling strategies G06F9/4881 and subgroups) · CPC title
Priority · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.