Neural Network Representation Formats
US-2022222541-A1 · Jul 14, 2022 · US
US12293275B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12293275-B2 |
| Application number | US-202117368523-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 6, 2021 |
| Priority date | Jul 6, 2021 |
| Publication date | May 6, 2025 |
| Grant date | May 6, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method to implement a fixed-point scale layer in a neural network for data processing is provided in the present disclosure. The method includes: receiving fixed-point input data over a channel of a standalone floating-point scale layer, and converting the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtaining fixed-point quantization parameters in each channel based on the input data and floating-point parameters γ i , β i in each channel; converting the standalone floating-point scale layer based on the fixed-point quantization parameters into a fixed-point scale layer for processing the fixed-point input data to generate fixed-point output data; and mapping the fixed-point scale layer to a fixed-point convolution layer and the computation of convolution is done by matrix multiplication that can be executed on a GEMM engine.
Opening claim text (preview).
What is claimed is: 1. A data processing method, comprising: receiving floating-point input data over a channel of a standalone floating-point scale layer in a neural network, and converting the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtaining fixed-point quantization parameters in each channel based on input data and two floating-point parameters γ i , β i in each channel, wherein the standalone floating-point scale layer comprises a plurality of channels, and the fixed-point quantization parameters are generated separately for each of the plurality of channels; converting the standalone floating-point scale layer based on the fixed-point quantization parameters into a fixed-point scale layer for processing the fixed-point input data to generate fixed-point output data; and mapping the fixed-point scale layer to a fixed-point convolution layer, wherein computation of convolution is done by matrix multiplication executed on a General Matrix Multiplication (GEMM) engine. 2. The data processing method according to claim 1 , wherein processing the fixed-point input data to generate the fixed-point output data further comprises: for the fixed-point input data in a size of 16-bit, multiplying the fixed-point input data with a first fixed-point quantization parameter in size S16 to receive a first output; summing up the first output and a second fixed-point quantization parameter in size S47 to receive a second output; right shifting with rounding the second output with an accumulator shift in size U8 to receive a third output; clamping the third output to receive a fourth output in size S32; multiplying the fourth output with an output scale in size U16 to receive a fifth output; right shifting with rounding the fifth output with an output shift in size U8 to receive a sixth output; and clamping the sixth output into the fixed-point output data in size S16 or U16. 3. The data processing method according to claim 2 , wherein for the fixed-point input data in the size of 16-bit, a preferred range of the second fixed-point quantization parameter is 30 to 47 bits. 4. The data processing method according to claim 1 , wherein processing the fixed-point input data to generate the fixed-point output data further comprises: for the fixed-point input data in a size of 8-bit, multiplying the fixed-point input data with a first fixed-point quantization parameter in size S8 or U8 to receive a first output; summing up the first output and a second fixed-point quantization parameter in size S31 to receive a second output; right shifting with rounding the second output with an accumulator shift in size U8 to receive a third output; clamping the third output to receive a fourth output in size S16; multiplying the fourth output with an output scale in size U16 to receive a fifth output; right shifting with rounding the fifth output with an output shift in size U8 to receive a sixth output; and clamping the sixth output into the fixed-point output data in size S8 or U8. 5. The data processing method according to claim 4 , wherein for the fixed-point input data in the size of 8-bit, a preferred range of the second fixed-point quantization parameter is 15 to 31 bits. 6. The data processing method according to claim 1 , wherein mapping the fixed-point scale layer to the fixed-point convolution layer further comprises: multiplying the fixed-point input data with a filter weight for the channel in the fixed-point scale layer to receive a product; and summing up the product and a bias for the channel in the fixed-point scale layer. 7. The data processing method according to claim 1 , wherein the matrix multiplication is executed on a GEMM engine or a Multiply-Accumulate (MAC) operations array. 8. An apparatus for implementing a neural network, comprising: one or more processors; and a memory configured to store instructions executable by the one or more processors; wherein the one or more processors, upon execution of the instructions, are configured to: receive floating-point input data over a channel of a standalone floating-point scale layer in a neural network, and convert the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtain fixed-point quantization parameters in each channel based on input data and two floating-point parameters γ i , β i in each channel, wherein the standalone floating-point scale layer comprises a plurality of channels, and the fixed-point quantization parameters are generated separately for each of the plurality of channels; convert the standalone floating-point scale layer based on the fixed-point quantization parameters into a fixed-point scale layer in the neural network for processing the fixed-point input data to generate fixed-point output data; and map the fixed-point scale layer to a fixed-point convolution layer, wherein computation of convolution is done by matrix multiplication executed on a GEMM engine. 9. The apparatus of claim 8 , wherein the one or more processors are further configured to: for the fixed-point input data in a size of 16-bit, multiply the fixed-point input data with a first fixed-point quantization parameter in size S16 to receive a first output; sum up the first output and a second fixed-point quantization parameter in size S47 to receive a second output; right shift with rounding the second output with an accumulator shift in size U8 to receive a third output; clamp the third output to receive a fourth output in size S32; multiply the fourth output with an output scale in size U16 to receive a fifth output; right shift with rounding the fifth output with an output shift in size U8 to receive a sixth output; and clamp the sixth output into the output data in size S16 or U16. 10. The apparatus of claim 9 , wherein for the fixed-point input data in the size of 16-bit, a preferred range of the second fixed-point quantization parameter is 30 to 47 bits. 11. The apparatus of claim 8 , wherein the one or more processors are further configured to: for the fixed-point input data in a size of 8-bit, multiply the fixed-point input data with a first fixed-point quantization parameter in size S8 or U8 to receive a first output; sum up the first output and a second fixed-point quantization parameter in size S31 to receive a second output; right shift with rounding the second output with an accumulator shift in size U8 to receive a third output; clamp the third output to receive a fourth output in size S16; multiply the fourth output with an output scale in size U16 to receive a fifth output; right shift with rounding the fifth output with a third parameter in size U8 to receive a sixth output; and clamp the sixth output into the fixed-point output data in size S8 or U8. 12. The apparatus of claim 11 , wherein for the fixed-point input data in the size of 8-bit, a preferred range of the second fixed-point quantization parameter is 15 to 31 bits. 13. The apparatus of claim 8 , the one or more processors are further configured to: multiply the fixed-point input data with a filter weight for the channel in the fixed-point scale layer to receive a product; and sum up the product and a bias for the channel in the fixed-point scale layer. 14. The apparatus of claim 8 , wherein the matrix multiplication is executed on a GEMM engine or a MAC operations array. 15. A non-transitory computer readable storage medium, comprising instructions stored therein to implement a neural network, wherein, upon execution of the instructions by one o
Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Architecture, e.g. interconnection topology · CPC title
using electronic means · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.