Systolic arithmetic on sparse data

US12361600B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12361600-B2
Application numberUS-202318322194-A
CountryUS
Kind codeB2
Filing dateMay 23, 2023
Priority dateNov 15, 2019
Publication dateJul 15, 2025
Grant dateJul 15, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provided for an instruction and associated logic to enable a processing resource including a tensor accelerator to perform optimized computation of sparse submatrix operations. One embodiment provides a parallel processor comprising a processing cluster coupled with the cache memory. The processing cluster includes a plurality of multiprocessors coupled with a data interconnect, where a multiprocessor of the plurality of multiprocessors includes a tensor core configured to load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data, perform an inverse transform of the first numerical transform, perform a tensor operation on the tensor data after the inverse transform is performed, and write output of the tensor operation to a memory coupled with the processing cluster.

First claim

Opening claim text (preview).

What is claimed is: 1. A parallel processor comprising: a cache memory; and a processing cluster coupled with the cache memory, the processing cluster including a plurality of multiprocessors coupled with a data interconnect, wherein a multiprocessor of the plurality of multiprocessors includes a tensor core configured to: load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data; perform an inverse transform of the first numerical transform; perform a tensor operation on the tensor data after the inverse transform is performed; and write output of the tensor operation to a memory coupled with the processing cluster. 2. The parallel processor of claim 1 , wherein to write output of the tensor operation to the memory coupled with the processing cluster includes to write the output to a main memory of the parallel processor. 3. The parallel processor of claim 2 , wherein to write the output of the tensor operation to the main memory of the parallel processor includes to write the output to the cache memory. 4. The parallel processor of claim 1 , the tensor core configured to: decompress or decode the tensor data before the inverse transform is applied to the tensor data. 5. The parallel processor of claim 1 , the tensor operation performed on the tensor data associated with an inference operation performed via the tensor core. 6. The parallel processor of claim 5 , the tensor core configured to compress or encode the output of the tensor operation within the tensor core after performing the numerical transform on the tensor data. 7. The parallel processor of claim 1 , the tensor core configured to apply a second numerical transform to the output of the tensor operation and generate metadata to indicate that the second numerical transform is applied to the output of the tensor operation. 8. The parallel processor of claim 7 , the tensor core configured to: apply the first numerical transform to at least a portion of the output of the tensor operation to generate first test transform data; apply the second numerical transform to at least a portion of the output of the tensor operation to generate second test transform data; determine compressibility metrics based on analysis of the first test transform data and the second test transform data; and apply the second transform to the output of the tensor operation based on the compressibility metrics. 9. The parallel processor of claim 8 , wherein to write the output of the tensor operation to the memory of the processing cluster includes to: compress or encode the output of the tensor operation after the second transform is applied; and write the output of the tensor operation after compression or encoding is applied to the output of the tensor operation. 10. The parallel processor of claim 9 , wherein the first numerical transform or the second numerical transform is selected from a set of numerical transforms including a discrete cosine transform, a discrete sine transform, a bit-flip transform, and a bit-rotate transform. 11. A method comprising: performing numerical operations to train a neural network model via a tensor core, including generating a first matrix of weights associated with the neural network model; applying a numerical transform to the first matrix of weights to generate a set of transformed weights and a transform type, wherein the transform type identifies the numerical transform applied to the first matrix of weights, the first matrix of weights is a sparse matrix, and the transformed weights compress to a higher compression ratio than the first matrix of weights; and applying a numerical inverse transform to the transformed weights to generate a second matrix of weights, wherein the numerical inverse transform to perform is identified via the transform type associated with the set of transformed weights. 12. The method of claim 11 , further comprising: applying a first numerical transform to at least a portion of the first matrix of weights to generate first test transform data; applying a second numerical transform to at least a portion of the first matrix of weights to generate second test transform data; determining compressibility metrics based on analysis of the first test transform data and the second test transform data; and sending a recommended transform to enable selection of a numerical transform to apply to the first matrix of weights. 13. The method of claim 12 , wherein the numerical transform is selected from a set of numerical transforms including a discrete cosine transform, a discrete sine transform, a bit-flip transform, and a bit-rotate transform. 14. A graphics processing system comprising: a memory device; and a graphics processor coupled with the memory device, the graphics processor including a cache memory and a processing cluster coupled with the cache memory, the processing cluster including a plurality of multiprocessors coupled with a data interconnect, wherein a multiprocessor of the plurality of multiprocessors includes a tensor core configured to: load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data; perform an inverse transform of the first numerical transform; perform a tensor operation on the tensor data after the inverse transform is performed; and write output of the tensor operation to a memory coupled with the processing cluster. 15. The graphics processing system of claim 14 , wherein to write output of the tensor operation to the memory coupled with the processing cluster includes to write the output to the memory device. 16. The graphics processing system of claim 15 , wherein to write the output of the tensor operation to the memory device. 17. The graphics processing system of claim 14 , the tensor core configured to: decompress or decode the tensor data before the inverse transform is applied to the tensor data. 18. The graphics processing system of claim 14 , the tensor operation performed on the tensor data associated with an inference operation performed via the tensor core. 19. The graphics processing system of claim 18 , the tensor core configured to compress or encode the output of the tensor operation within the tensor core after performing the numerical transform on the tensor data. 20. The graphics processing system of claim 14 , the tensor core configured to apply a second numerical transform to the output of the tensor operation and generate metadata to indicate that the second numerical transform is applied to the output of the tensor operation.

Assignees

Inventors

Classifications

  • Reinforcement learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

  • Generative networks · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12361600B2 cover?
Embodiments described herein provided for an instruction and associated logic to enable a processing resource including a tensor accelerator to perform optimized computation of sparse submatrix operations. One embodiment provides a parallel processor comprising a processing cluster coupled with the cache memory. The processing cluster includes a plurality of multiprocessors coupled with a data …
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).