Compression of machine learning models via sparsification and quantization

US2025094864A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025094864-A1
Application numberUS-202418602951-A
CountryUS
Kind codeA1
Filing dateMar 12, 2024
Priority dateSep 14, 2023
Publication dateMar 20, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Machine learning is a process that learns a model from a given dataset, where the model can then be used to make a prediction about new data. In order to reduce the size, computation, and latency of a machine learning model, a compression technique can be employed which includes model sparsification and quantization. To limit the extent to which the quality of the model is impacted when uniformly applying sparsification and quantization to all values of the model, the present disclosure provides for a hybrid sparsification and quantization of the model.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: at a device, compressing a machine learning model having a plurality of values to reduce at least one of a size of the machine learning model or computation requirements of the machine learning model, by: processing the machine learning model to generate a plurality of sparse data structures including: storing inlier values of the machine learning model in a first data structure, and storing outlier values of the machine learning model in a second data structure, wherein at least one of the first data structure or the second data structure has a structured sparse pattern; and non-uniformly quantizing the machine learning model, including: quantizing the first data structure storing the inlier values to a first bit width, and quantizing the second data structure storing the outlier values to a second bit width that is different from the first bit width. 2 . The method of claim 1 , wherein the inlier values and the outlier values are weights of the machine learning model. 3 . The method of claim 1 , wherein the inlier values and the outlier values are determined according to a defined threshold metric. 4 . The method of claim 1 , wherein the first data structure has a first structured sparse pattern that has less sparsity than a second structured sparse pattern of the second data structure. 5 . The method of claim 1 , wherein the first data structure has a first structured sparse pattern that is the same as a second structured sparse pattern of the second data structure. 6 . The method of claim 1 , wherein the first bit width and the second bit width are supported by different hardware accelerators. 7 . A method, comprising: at a device: apportioning a plurality of different subsets of values of the machine learning model into a plurality of data structures at least one of which has a defined structured sparse pattern; and changing a data representation of at least one data structure of the plurality of data structures, wherein at least two data structures of the plurality of data structures have different data representations. 8 . The method of claim 7 , wherein the machine learning model is a deep neural network. 9 . The method of claim 7 , wherein the machine learning model is a large language model (LLM). 10 . The method of claim 7 , wherein the values of the machine learning model are weights of the machine learning model. 11 . The method of claim 7 , wherein the plurality of data structures are tensors. 12 . The method of claim 7 , wherein at least two of the plurality of data structures have different defined structured sparse patterns. 13 . The method of claim 12 , wherein the different defined structured sparse patterns include at least: a first defined structured sparse pattern having a first sparsity degree, and a second defined structured sparse pattern having a second sparsity degree, wherein the first sparsity degree is different from the second sparsity degree. 14 . The method of claim 7 , wherein the plurality of different subsets of values of the machine learning model include: at least one subset comprised of at least a portion of inlier values of the machine learning model, and at least another subset comprised of at least a portion of outlier values of the machine learning model. 15 . The method of claim 14 , wherein the inlier values and the outlier values are determined according to a defined threshold metric. 16 . The method of claim 15 , wherein the defined threshold metric is a magnitude of weight. 17 . The method of claim 15 , wherein the defined threshold metric is an error after quantization for inlier and outlier. 18 . The method of claim 15 , wherein the defined threshold metric is a product of a corresponding weight and activation. 19 . The method of claim 14 , wherein at least a portion of the inlier values are stored with a first structured sparse pattern that has less sparsity than a second structured sparse pattern used to store at least a portion of the outlier values. 20 . The method of claim 7 , wherein the machine learning model is further compressed by: sparsifying the machine learning model by pruning values from the machine learning model, to form a sparse machine learning model, wherein the plurality of different subsets of values of the machine learning model are determined from the sparse machine learning model. 21 . The method of claim 20 , wherein the values of the machine learning model are selected for pruning according to a defined threshold metric. 22 . The method of claim 21 , wherein the defined threshold metric is a magnitude of weight. 23 . The method of claim 21 , wherein the defined threshold metric is an error after pruning. 24 . The method of claim 21 , wherein the defined threshold metric is a product of a corresponding weight and activation obtained with training or validation data. 25 . The method of claim 20 , wherein the machine learning model is sparsified to a defined degree of sparsity. 26 . The method of claim 20 , wherein the machine learning model is sparsified with a defined structured sparse pattern. 27 . The method of claim 7 , wherein changing the data representation of the at least one data structure includes quantizing the at least one data structure. 28 . The method of claim 7 , wherein the data representation of the plurality of data structures includes a bit width of the plurality of data structures. 29 . The method of claim 7 , wherein the data representation of the plurality of data structures includes a data type of the plurality of data structures. 30 . The method of claim 7 , wherein the different data representations are supported by a single hardware accelerator or multiple different hardware accelerators. 31 . The method of claim 7 , wherein at least two of the plurality of data structures have different defined structured sparse patterns, and wherein the different defined structured sparse patterns are supported by a single hardware accelerator or multiple different hardware accelerators. 32 . A system, comprising: a non-transitory memory storage comprising instructions; and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to at least one of compress a machine learning model or reduce a computation of the machine learning model by: apportioning a plurality of different subsets of values of the machine learning model into a plurality of data structures at least one of which has a defined structured sparse pattern; and changing a data representation of at least one data structure of the plurality of data structures, wherein at least two data structures of the plurality of data structures have different data representations. 33 . The system of claim 32 , wherein the machine learning model is a deep neural network. 34 . The system of claim 32 , wherein the machine learning model is a large language model (LLM). 35 . The system of claim 32 , wherein the values of the machine learning model are weights of the machine learning model. 36 . The system of claim 32 , wherein at least two of the plurality of d

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025094864A1 cover?
Machine learning is a process that learns a model from a given dataset, where the model can then be used to make a prediction about new data. In order to reduce the size, computation, and latency of a machine learning model, a compression technique can be employed which includes model sparsification and quantization. To limit the extent to which the quality of the model is impacted when uniform…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 20 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).