Model quantization for software engineering tasks

US2023222334A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023222334-A1
Application numberUS-202217572459-A
CountryUS
Kind codeA1
Filing dateJan 10, 2022
Priority dateJan 10, 2022
Publication dateJul 13, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A deep learning model is quantized during its training to perform a target software engineering task. During training, a portion of the full-precision floating point weights is quantized into INT4 or INT 8 data types through scalar quantization or product quantization to make the model more resilient to quantization and to reduce the noise between the quantized and full-precision model outputs. In scalar quantization, each sub-block consists of a single weight that is mapped into a codeword of a codebook. In product quantization, an identity matrix and a codebook of centroids is used to map a quantized weight into its original value.

First claim

Opening claim text (preview).

What is claimed: 1 . A system comprising: a processor; and a memory that stores a program configured to be executed by the processor, the program including instructions to perform acts that: obtain a deep learning model having a plurality of layers, each layer having a plurality of weight matrices; train the deep learning model to determine a value for each weight of each of the plurality of weight matrices that minimizes a loss function through application of training samples to each layer of the plurality of layers, wherein each weight matrix includes a first portion and a second portion, wherein the first portion of each weight matrix is quantized with reduced bit-width weights, wherein the second portion includes full-precision floating point values; and upon completion of the training of the deep learning model, quantize each weight matrix of the plurality of weight matrices with reduced bit-width weights. 2 . The system of claim 1 , wherein the program includes instructions to perform acts that: generate a codebook for each of the plurality of weight matrices, wherein the codebook includes a plurality of uniformly-distributed range of values. 3 . The system of claim 1 , wherein the program includes instructions to perform acts that: generate a codebook for each of the plurality of weight matrices, wherein the codebook includes a plurality of centroids, wherein each centroid of the plurality of centroids is generated from K-means clustering of weights of a respective weight matrix. 4 . The system of claim 3 , wherein the program includes instructions to perform acts that: generate an index matrix that maps a weight of a respective weight matrix into a select one of the centroids of the codebook. 5 . The system of claim 1 , wherein the program includes instructions to perform acts that: randomly select weights in the first portion of each weight matrix to quantized with reduced bit-widths. 6 . The system of claim 1 , wherein the reduced bit-width weights are fixed-point integers. 7 . The system of claim 1 , wherein the reduced bit-width weights are INT4 or INT8 data types. 8 . The system of claim 1 , wherein the deep learning model is a neural transformer model with attention. 9 . A computer-implemented method, comprising: obtaining a deep learning model having a plurality of layers, each layer having a plurality of weight matrices; training the deep learning model to learn values for each weight of the plurality of weight matrices that minimize a loss function by: selecting a first portion of each weight matrix at each layer to quantize; quantizing weights of the first portion of each weight matrix with fixed-point integer representations; performing computations at each layer with the fixed-point integer representations; computing an error loss from the computations; determining a full-precision gradient to update the quantized weights using an estimator; determining a full-precision gradient to update unquantized weights using stochastic gradient descent; and updating the values of the weights of each weight matrix based on the full-precision gradient; and upon completion of the training, quantizing each weight of each weight matrix into a fixed-point integer representation. 10 . The method of claim 9 , further comprising: decomposing each weight matrix into sub-blocks; and randomly choosing a select one of the sub-blocks as the first portion. 11 . The method of claim 9 , further comprising: generating a codebook for a first weight matrix, wherein the codebook includes a plurality of uniformly-distributed range of values based on an n-bit representation of the fixed-point integer representation; and mapping a weight of the first weight matrix into a value of the codebook. 12 . The method of claim 9 , further comprising: generating a codebook for a second weight matrix, wherein the codebook includes a plurality of centroids, wherein each centroid of the plurality of centroids is generated from K-means clustering of weights of the second weight matrix. 13 . The method of claim 12 , further comprising: generating an index matrix to map a weight of the second weight matrix into the select centroid of the codebook. 14 . The method of claim 9 , wherein the fixed-point integer representations are INT4 or INT8 data types. 15 . The method of claim 9 , wherein the deep learning model is a neural transformer model with attention. 16 . A device comprising: a processor and a memory; wherein the memory includes instructions that when executed on the processor performs actions that: configure a deep learning model with a plurality of layers, each of the plurality of layers having at least one weight matrix, the at least one weight matrix including a plurality of weights; train the deep learning model to learn to generate source code by computing values for each of the plurality of weights that minimizes an error function, wherein during training of the deep learning model: select a first portion of the at least one weight matrix to quantize with integer data types and selecting a second portion of the at least one weight matrix expressed as full-precision floating point values; determine values for weights of the at least one weight matrix through multiple iterations of a forward pass, backward pass, and weight update using the first portion of weights and the second portion of weights; and upon completion of the training, quantizing all weights of the at least one weight matrix to integer data types. 17 . The device of claim 16 , wherein the memory includes instructions that when executed on the processor performs actions that: generating a codebook for the at least one weight matrix, wherein the codebook includes a plurality of centroids; and computing the plurality of centroids for the at least one weight matrix using K-means clustering. 18 . The device of claim 17 , wherein the memory includes instructions that when executed on the processor performs actions that: generating an index matrix that maps a quantized weight of the at least one weight matrix into a centroid. 19 . The device of claim 16 , wherein the quantized weights are INT4 or INT8 data types. 20 . The device of claim 16 , wherein the deep learning model is a neural transformer model with attention.

Assignees

Inventors

Classifications

  • G06N3/0495Primary

    Quantised networks; Sparse networks; Compressed networks · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • using electronic means · CPC title

  • Selection of the most significant subset of features · CPC title

  • with fixed number of clusters, e.g. K-means clustering · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023222334A1 cover?
A deep learning model is quantized during its training to perform a target software engineering task. During training, a portion of the full-precision floating point weights is quantized into INT4 or INT 8 data types through scalar quantization or product quantization to make the model more resilient to quantization and to reduce the noise between the quantized and full-precision model outputs.…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/0495. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 13 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).