Fixed point integer implementations for neural networks
US-2021004686-A1 · Jan 7, 2021 · US
US12355471B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12355471-B2 |
| Application number | US-202218084948-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 20, 2022 |
| Priority date | Aug 4, 2017 |
| Publication date | Jul 8, 2025 |
| Grant date | Jul 8, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of generating a fixed-point quantized neural network includes analyzing a statistical distribution for each channel of floating-point parameter values of feature maps and a kernel for each channel from data of a pre-trained floating-point neural network, determining a fixed-point expression of each of the parameters for each channel statistically covering a distribution range of the floating-point parameter values based on the statistical distribution for each channel, determining fractional lengths of a bias and a weight for each channel among the parameters of the fixed-point expression for each channel based on a result of performing a convolution operation, and generating a fixed-point quantized neural network in which the bias and the weight for each channel have the determined fractional lengths.
Opening claim text (preview).
What is claimed is: 1. A method of generating a fixed-point quantized neural network, the method comprising: analyzing a statistical distribution of floating-point values for each channel of feature maps and a kernel for each channel from data of a pre-trained floating-point neural network; quantizing the floating-point values for each channel to fixed-point values for each channel based on the statistical distribution for each channel; determining fractional lengths of fixed-point expressions of parameters for performing an operation of the quantized fixed-point values, for each channel; and generating a fixed-point quantized neural network in which the parameters for each channel have the determined fractional lengths being different for at least some channels, including performing a channel-wise quantization for each channel included in the pre-trained floating-point neural network, wherein the determining of the fractional lengths comprises determining a fractional length of a bias for each channel based on fractional lengths of input activations and fractional lengths of weights for each channel input to multiply-accumulate (MAC) operations, and the determining of the fractional lengths further comprises determining a fractional length of a weight of the weights for each channel by decreasing the fractional length of the weight by a difference between the fractional length of one of the fixed-point expressions corresponding to a result of one of the MAC operations to which the weight was input and the determined fractional length of the bias. 2. The method of claim 1 , wherein the analyzing of the statistical distribution comprises obtaining statistics for each channel of the floating-point values of weights, input activations, and output activations used in each channel during pre-training of the pre-trained floating-point neural network. 3. The method of claim 1 , wherein the operation comprises a partial sum operation of a convolution operation between a plurality of channels, the partial sum operation comprises a plurality of multiply-accumulate (MAC) operations and an Add operation, the parameters comprise the bias and the weight for each channel. 4. The method of claim 3 , wherein the determining of the fractional length of the bias comprises determining the fractional length of the bias based on a maximum fractional length among fractional lengths of fixed-point expressions corresponding to results of the MAC operations. 5. The method of claim 4 , wherein the partial sum operation comprises: a first MAC operation between a first input activation of a first channel of an input feature map of the feature maps and a first weight of a first channel of the kernel; a second MAC operation between a second input activation of a second channel of the input feature map and a second weight of a second channel of the kernel; and an Add operation between a result of the first MAC operation, a result of the second MAC operation, and the bias, and the determining of the fractional length of the bias further comprises: obtaining a first fractional length of a first fixed-point expression corresponding to the result of the first MAC operation; obtaining a second fractional length of a second fixed-point expression corresponding to the result of the second MAC operation; and determining the fractional length of the bias to be a maximum fractional length among the first fractional length and the second fractional length. 6. The method of claim 5 , further comprising bit-shifting a fractional length of a fixed-point expression having a smaller fractional length among the first fixed-point expression and the second fixed-point expression based on the determined fractional length of the bias, wherein the fixed-point quantized neural network comprises information about an amount of the bit-shifting. 7. The method of claim 3 , wherein the determining of the fractional length of the bias comprises determining the fractional length of the bias to be a minimum fractional length among fractional lengths of fixed-point expressions respectively corresponding to results of the MAC operations. 8. The method of claim 3 , wherein the partial sum operation comprises: a first MAC operation between a first input activation of a first channel of an input feature map of the feature maps and a first weight of a first channel of the kernel; a second MAC operation between a second input activation of a second channel of the input feature map and a second weight of a second channel of the kernel; and an Add operation between a result of the first MAC operation, a result of the second MAC operation, and the bias, the determining of the fractional lengths further comprises: obtaining a first fractional length of a first fixed-point expression corresponding to the result of the first MAC operation; and obtaining a second fractional length of a second fixed-point expression corresponding to the result of the second MAC operation, the determining of the fractional length of the bias comprises determining the fractional length of the bias to be a minimum fractional length among the first fractional length and the second fractional length, and the determining of the fractional lengths further comprises tuning a fractional length of the weight input to one of the first MAC operation and the second MAC operation that produces a result having a fixed-point expression having the minimum fractional length by decreasing the fractional length of the weight by a difference between the first fractional length and the second fractional length. 9. The method of claim 1 , wherein the statistical distribution for each channel is a distribution approximated by a normal distribution or a Laplace distribution, and the quantizing of the floating-point values comprises determining fixed-point expression of the fixed-point values based on a fractional length for each channel determined based on any one or any combination of any two or more of a mean, a variance, a standard deviation, a maximum value, and a minimum value of the floating-point values for each channel obtained from the statistical distribution for each channel. 10. The method of claim 1 , further comprising retraining, after the determining of the fractional lengths is completed, the fixed-point quantized neural network with the determined fractional lengths of the parameters for each channel set as constraints of the fixed-point quantized neural network to fine tune the fixed-point quantized neural network. 11. An apparatus for generating a fixed-point quantized neural network, the apparatus comprising: a memory configured to store at least one program; and a processor configured to execute the at least one program, wherein the processor executing the at least one program configures the processor to: analyze a statistical distribution of floating-point values for each channel of feature maps and a kernel for each channel from data of a pre-trained floating-point neural network, quantize the floating-point values for each channel to fixed-point values for each channel based on the statistical distribution for each channel, determine fractional lengths of fixed-point expressions of parameters for performing an operation of the quantized fixed-point values, for each channel, and generate a fixed-point quantized neural network in which the parameters for each channel have the determined fractional lengths being different for at least some channels, including performing a channel-wise quantization for each channel included in the pre-trained floating-point neural network, wherein the determining of the fractional lengths comprises determining a fracti
Neural networks · CPC title
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
Rounding · CPC title
Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.