Programmable coarse grained and sparse matrix compute hardware with advanced scheduling

US11210760B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11210760-B2
Application numberUS-202016928353-A
CountryUS
Kind codeB2
Filing dateJul 14, 2020
Priority dateApr 28, 2017
Publication dateDec 28, 2021
Grant dateDec 28, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex machine learning compute operation.

First claim

Opening claim text (preview).

The invention claimed is: 1. A compute apparatus comprising: a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex compute operation including multiple pipeline commands; a memory controller including a near-data compute unit; first circuitry to schedule the multiple pipeline commands to one or more of multiple types of compute units, wherein the multiple types of compute units include a general-purpose graphics compute unit and a near-data compute unit; and second circuitry to determine operations to perform for the single instruction, the second circuitry coupled with the memory controller, wherein the operations include to offload a compute kernel to the near-data compute unit and to offload the compute kernel to the near-data compute unit includes to determine an address range for a near-data compute operation within the compute kernel and offload the compute kernel to the near-data compute unit in response to a determination that the memory controller is associated with the address range of the near-data compute operation. 2. The compute apparatus as in claim 1 , additionally including third circuitry including a fetch unit to fetch the single instruction and store the single instruction to a cache memory. 3. The compute apparatus as in claim 2 , the third circuitry additionally including the decode unit. 4. The compute apparatus as in claim 3 , wherein the complex compute operation is to perform a convolution for a layer of a convolutional neural network, wherein the convolution includes multiple matrix operations. 5. The compute apparatus as in claim 4 , wherein the multiple types of compute units include a sparse compute unit, the sparse compute unit is configured to accelerate primitives associated with the multiple matrix operations. 6. The compute apparatus as in claim 5 , wherein the multiple matrix operations are performed on one or more sparse matrices. 7. The compute apparatus as in claim 6 , wherein the compute kernel offloaded to the near-data compute unit is to perform a gather operation to read elements of the one or more sparse matrices from memory. 8. The compute apparatus as in claim 6 , wherein the compute kernel offloaded to the near-data compute unit is to perform a scatter operation to write elements of a sparse output matrix to memory, the sparse output matrix generated by at least one of the multiple matrix operations. 9. The compute apparatus as in claim 6 , additionally including a machine learning accelerator to determine a set of operations to perform to execute the decoded instruction, wherein the set of operations includes to offload the compute kernel to the near-data compute unit and determine the multiple pipeline commands to perform for the complex compute operation. 10. The compute apparatus as in claim 9 , additionally including a micro-controller to provide the machine learning accelerator. 11. A method of performing machine learning operations, the method comprising: decoding a single instruction into a decoded instruction, the decoded instruction associated with a set of multiple machine learning operations to be performed via a compute pipeline of a general-purpose graphics processing unit; determining a set of pipeline commands to perform the set of multiple machine learning operations, wherein the set of pipeline commands offload a near-data compute operation to a near-data compute unit; and scheduling the set of pipeline commands to the compute pipeline of the general-purpose graphics processing unit. 12. The method as in claim 11 , wherein determining the set of pipeline commands to perform the set of multiple machine learning operations includes analyzing parameters associated with the decoded instruction. 13. The method as in claim 12 , wherein analyzing parameters associated with the decoded instruction includes selecting the near-data compute unit from a set of multiple near-data compute units. 14. The method as in claim 13 , wherein the set of multiple near-data compute units are associated with a set of multiple memory controllers, each memory controller in the set of multiple memory controllers having an associated address range. 15. The method as in claim 14 , wherein selecting the near-data compute unit from a set of multiple near-data compute units includes selecting the near-data compute unit within a memory controller of the set of multiple memory controllers for a memory address associated with the near-data compute operation. 16. A data processing system comprising: a general-purpose graphics processing unit including a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the data processing system to execute multiple pipeline commands to perform a complex machine learning compute operation; a memory coupled to the general-purpose graphics processing unit; and a memory controller coupled with the general-purpose graphics processing unit and the memory, the memory controller including a near-data compute unit, wherein the multiple pipeline commands include a command to offload an operation of a compute kernel to the near-data compute unit. 17. The data processing system as in claim 16 , the general-purpose graphics processing unit additionally including a sparse compute unit, wherein the multiple pipeline commands include a command to perform a matrix operation via the sparse compute unit. 18. The data processing system as in claim 17 , wherein the operation offloaded to the near-data compute unit is a gather operation to read elements of a sparse matrix associated with the matrix operation. 19. The data processing system as in claim 17 , wherein the operation offloaded to the near-data compute unit is a scatter operation to write elements of a sparse matrix associated with the matrix operation. 20. The data processing system as in claim 17 , wherein the sparse compute unit is configured to accelerate a primitive associated with the matrix operation.

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11210760B2 cover?
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex machine learning compute operation.
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 28 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).