Apparatus and method for determining a sector division ratio of a shared cache memory
US-2015339229-A1 · Nov 26, 2015 · US
US12361600B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12361600-B2 |
| Application number | US-202318322194-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 23, 2023 |
| Priority date | Nov 15, 2019 |
| Publication date | Jul 15, 2025 |
| Grant date | Jul 15, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments described herein provided for an instruction and associated logic to enable a processing resource including a tensor accelerator to perform optimized computation of sparse submatrix operations. One embodiment provides a parallel processor comprising a processing cluster coupled with the cache memory. The processing cluster includes a plurality of multiprocessors coupled with a data interconnect, where a multiprocessor of the plurality of multiprocessors includes a tensor core configured to load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data, perform an inverse transform of the first numerical transform, perform a tensor operation on the tensor data after the inverse transform is performed, and write output of the tensor operation to a memory coupled with the processing cluster.
Opening claim text (preview).
What is claimed is: 1. A parallel processor comprising: a cache memory; and a processing cluster coupled with the cache memory, the processing cluster including a plurality of multiprocessors coupled with a data interconnect, wherein a multiprocessor of the plurality of multiprocessors includes a tensor core configured to: load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data; perform an inverse transform of the first numerical transform; perform a tensor operation on the tensor data after the inverse transform is performed; and write output of the tensor operation to a memory coupled with the processing cluster. 2. The parallel processor of claim 1 , wherein to write output of the tensor operation to the memory coupled with the processing cluster includes to write the output to a main memory of the parallel processor. 3. The parallel processor of claim 2 , wherein to write the output of the tensor operation to the main memory of the parallel processor includes to write the output to the cache memory. 4. The parallel processor of claim 1 , the tensor core configured to: decompress or decode the tensor data before the inverse transform is applied to the tensor data. 5. The parallel processor of claim 1 , the tensor operation performed on the tensor data associated with an inference operation performed via the tensor core. 6. The parallel processor of claim 5 , the tensor core configured to compress or encode the output of the tensor operation within the tensor core after performing the numerical transform on the tensor data. 7. The parallel processor of claim 1 , the tensor core configured to apply a second numerical transform to the output of the tensor operation and generate metadata to indicate that the second numerical transform is applied to the output of the tensor operation. 8. The parallel processor of claim 7 , the tensor core configured to: apply the first numerical transform to at least a portion of the output of the tensor operation to generate first test transform data; apply the second numerical transform to at least a portion of the output of the tensor operation to generate second test transform data; determine compressibility metrics based on analysis of the first test transform data and the second test transform data; and apply the second transform to the output of the tensor operation based on the compressibility metrics. 9. The parallel processor of claim 8 , wherein to write the output of the tensor operation to the memory of the processing cluster includes to: compress or encode the output of the tensor operation after the second transform is applied; and write the output of the tensor operation after compression or encoding is applied to the output of the tensor operation. 10. The parallel processor of claim 9 , wherein the first numerical transform or the second numerical transform is selected from a set of numerical transforms including a discrete cosine transform, a discrete sine transform, a bit-flip transform, and a bit-rotate transform. 11. A method comprising: performing numerical operations to train a neural network model via a tensor core, including generating a first matrix of weights associated with the neural network model; applying a numerical transform to the first matrix of weights to generate a set of transformed weights and a transform type, wherein the transform type identifies the numerical transform applied to the first matrix of weights, the first matrix of weights is a sparse matrix, and the transformed weights compress to a higher compression ratio than the first matrix of weights; and applying a numerical inverse transform to the transformed weights to generate a second matrix of weights, wherein the numerical inverse transform to perform is identified via the transform type associated with the set of transformed weights. 12. The method of claim 11 , further comprising: applying a first numerical transform to at least a portion of the first matrix of weights to generate first test transform data; applying a second numerical transform to at least a portion of the first matrix of weights to generate second test transform data; determining compressibility metrics based on analysis of the first test transform data and the second test transform data; and sending a recommended transform to enable selection of a numerical transform to apply to the first matrix of weights. 13. The method of claim 12 , wherein the numerical transform is selected from a set of numerical transforms including a discrete cosine transform, a discrete sine transform, a bit-flip transform, and a bit-rotate transform. 14. A graphics processing system comprising: a memory device; and a graphics processor coupled with the memory device, the graphics processor including a cache memory and a processing cluster coupled with the cache memory, the processing cluster including a plurality of multiprocessors coupled with a data interconnect, wherein a multiprocessor of the plurality of multiprocessors includes a tensor core configured to: load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data; perform an inverse transform of the first numerical transform; perform a tensor operation on the tensor data after the inverse transform is performed; and write output of the tensor operation to a memory coupled with the processing cluster. 15. The graphics processing system of claim 14 , wherein to write output of the tensor operation to the memory coupled with the processing cluster includes to write the output to the memory device. 16. The graphics processing system of claim 15 , wherein to write the output of the tensor operation to the memory device. 17. The graphics processing system of claim 14 , the tensor core configured to: decompress or decode the tensor data before the inverse transform is applied to the tensor data. 18. The graphics processing system of claim 14 , the tensor operation performed on the tensor data associated with an inference operation performed via the tensor core. 19. The graphics processing system of claim 18 , the tensor core configured to compress or encode the output of the tensor operation within the tensor core after performing the numerical transform on the tensor data. 20. The graphics processing system of claim 14 , the tensor core configured to apply a second numerical transform to the output of the tensor operation and generate metadata to indicate that the second numerical transform is applied to the output of the tensor operation.
Reinforcement learning · CPC title
Distributed learning, e.g. federated learning · CPC title
Generative networks · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.