Incremental network quantization
US-2020380357-A1 · Dec 3, 2020 · US
US11645493B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11645493-B2 |
| Application number | US-201815972054-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 4, 2018 |
| Priority date | May 4, 2018 |
| Publication date | May 9, 2023 |
| Grant date | May 9, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and apparatus are disclosed supporting a design flow for developing quantized neural networks. In one example of the disclosed technology, a method includes quantizing a normal-precision floating-point neural network model into a quantized format. For example, the quantized format can be a block floating-point format, where two or more elements of tensors in the neural network share a common exponent. A set of test input is applied to a normal-precision flooding point model and the corresponding quantized model and the respective output tensors are compared. Based on this comparison, hyperparameters or other attributes of the neural networks can be adjusted. Further, quantization parameters determining the widths of data and selection of shared exponents for the block floating-point format can be selected. An adjusted, quantized neural network is retrained and programmed into a hardware accelerator.
Opening claim text (preview).
What is claimed is: 1. A method comprising: prior to applying a quantization operation, reshaping or splitting at least one tensor that is included among multiple tensors of a normal-precision neural network model, said reshaping or splitting facilitates handling of shared exponents over granulates that are finer than an entire tensor; performing the quantization operation by quantizing the normal-precision neural network model, which comprises the multiple tensors of normal-precision floating-point numbers, producing a quantized neural network model in a quantized-precision format; evaluating the quantized neural network model by applying input tensors to an input layer of the quantized neural network model, producing quantized output; comparing the quantized output to output generated by applying the input tensors to the normal-precision floating-point model; and based on the comparing, selecting a new quantized-precision format having at least one quantization parameter different than the quantized-precision format, wherein the at least one different quantization parameter comprises, for at least one layer of the quantized neural network, at least a parameter to share an exponent on a per-row basis and/or a parameter to share an exponent on a per-column basis. 2. The method of claim 1 , wherein: the quantized-precision format is a block floating-point format where at least two elements of the quantized neural network model share a common exponent. 3. The method of claim 1 , further comprising: based on the comparing, retraining the quantized neural network model by adjusting at least one or more training parameters used to train the normal-precision neural network and training the quantized neural network with the adjusted at least one training parameter. 4. The method of claim 3 , wherein the adjusted at least one of the training parameters comprises at least one of the following: a batch size, a momentum value, a number of training epochs, or a drop out rate. 5. The method of claim 3 , further comprising: producing the normal-precision neural network by training an untrained normal-precision neural network according to one or more training parameters at a selected learning rate; and wherein the adjusting the at least one of the training parameters comprises adjusting a learning rate to be lower than the selected learning rate used to train the untrained normal-precision neural network. 6. The method of claim 1 , further comprising: quantizing the normal-precision neural network model to produce a re-quantized neural network model in the new quantized-precision format. 7. The method of claim 6 , wherein the at least one different quantization parameter comprises, for at least one layer of the quantized neural network, at least one of: a bit width used to represent bit widths of node weight mantissas, a bit width used to represent bit widths of node weight exponents, a bit width used to represent bit widths of activation value mantissas, a bit width used to represent bit widths of activation value exponents, a tile size for a shared exponent, or a parameter specifying a method of common exponent selection. 8. The method of claim 1 , further comprising, based on the comparing, sparsifying at least one weight of the quantized neural network. 9. The method of claim 1 , further comprising: based on the comparing, changing a hyperparameter used to train the normal-precision neural network or the quantized-precision network and retraining the quantized-precision neural network with the changed hyperparameter; and wherein the changed hyperparameter includes one of a number of hidden layers in the normal-precision neural network, a node type for a layer of the normal-precision neural network, or a learning rate for training the neural network. 10. The method of claim 1 , wherein the normal-precision neural network model is quantized according to a set of one or more quantization parameters, the method further comprising: based on the comparing, adjusting at least one of the quantization parameters; and retraining the quantized neural network model using the adjusted at least one of the quantization parameters. 11. A quantization-enabled system for modeling a neural network comprising tensors representing node weights and edges, the system comprising: one or more processors; and one or more computer readable storage media that store computer-readable instructions that are executable by the one or more processors to cause the system to: prior to applying a quantization operation, reshape or split at least one tensor that is included among multiple tensors of a normal-precision neural network model, said reshaping or splitting facilitates handling of shared exponents over granulates that are finer than an entire tensor; transform the normal-precision neural network model to a block floating-point format neural network model according to a set of quantization parameters, the block floating-point format model including at least one shared exponent; apply input tensors to an input layer of the block floating-point format neural network model, producing first output values; calculate differences between the first output values and second output values generated by applying the input tensors to the normal-precision neural network model; and responsive to the calculated differences, select a new block floating-point format having at least one parameter different than the set of quantization parameters, wherein the at least one different parameter comprises, for at least one layer of the block floating-point format neural network model, a parameter to share an exponent on a per-row basis and/or a parameter to share an exponent on a per-column basis. 12. The system of claim 11 , wherein execution of the computer-readable instructions further causes the system to: retrain the normal-precision neural network model by adjusting a hyperparameter and retraining the normal-precision neural network model with the adjusted hyperparameter. 13. The system of claim 11 , wherein execution of the computer-readable instructions further causes the system to: retrain the block floating-point format neural network model by adjusting a hyperparameter and retraining the block floating-point format neural network model with the adjusted hyperparameter. 14. The system of claim 11 , wherein execution of the computer-readable instructions further causes the system to: perform the quantization operation by quantizing the normal-precision neural network model to produce a re-quantized neural network model in the new block floating-point format. 15. The system of claim 14 , wherein the at least one different parameter comprises, for at least one layer of block floating-point format neural network model: a bit width used to represent bit widths of node weight mantissas, a bit width used to represent bit widths of node weight exponents, a bit width used to represent bit widths of activation value mantissas, a bit width used to represent bit widths of activation value exponents, a tile size for a shared exponent, or a parameter specifying a method of common exponent selection. 16. The system of claim 11 , further comprising: a hardware accelerator configured to evaluate the block floating-point format neural network model by receiving input tensors, processing operations for nodes of the block floating-point neural network model representing in the block floating-point format, and produce an output tensor; and wherein the processors are configured to configure the hardware accelerator with the block floati
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.