Methods and apparatuses for high performance and accuracy fixed-point scale implementation

US12293275B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12293275-B2
Application numberUS-202117368523-A
CountryUS
Kind codeB2
Filing dateJul 6, 2021
Priority dateJul 6, 2021
Publication dateMay 6, 2025
Grant dateMay 6, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method to implement a fixed-point scale layer in a neural network for data processing is provided in the present disclosure. The method includes: receiving fixed-point input data over a channel of a standalone floating-point scale layer, and converting the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtaining fixed-point quantization parameters in each channel based on the input data and floating-point parameters γ i , β i in each channel; converting the standalone floating-point scale layer based on the fixed-point quantization parameters into a fixed-point scale layer for processing the fixed-point input data to generate fixed-point output data; and mapping the fixed-point scale layer to a fixed-point convolution layer and the computation of convolution is done by matrix multiplication that can be executed on a GEMM engine.

First claim

Opening claim text (preview).

What is claimed is: 1. A data processing method, comprising: receiving floating-point input data over a channel of a standalone floating-point scale layer in a neural network, and converting the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtaining fixed-point quantization parameters in each channel based on input data and two floating-point parameters γ i , β i in each channel, wherein the standalone floating-point scale layer comprises a plurality of channels, and the fixed-point quantization parameters are generated separately for each of the plurality of channels; converting the standalone floating-point scale layer based on the fixed-point quantization parameters into a fixed-point scale layer for processing the fixed-point input data to generate fixed-point output data; and mapping the fixed-point scale layer to a fixed-point convolution layer, wherein computation of convolution is done by matrix multiplication executed on a General Matrix Multiplication (GEMM) engine. 2. The data processing method according to claim 1 , wherein processing the fixed-point input data to generate the fixed-point output data further comprises: for the fixed-point input data in a size of 16-bit, multiplying the fixed-point input data with a first fixed-point quantization parameter in size S16 to receive a first output; summing up the first output and a second fixed-point quantization parameter in size S47 to receive a second output; right shifting with rounding the second output with an accumulator shift in size U8 to receive a third output; clamping the third output to receive a fourth output in size S32; multiplying the fourth output with an output scale in size U16 to receive a fifth output; right shifting with rounding the fifth output with an output shift in size U8 to receive a sixth output; and clamping the sixth output into the fixed-point output data in size S16 or U16. 3. The data processing method according to claim 2 , wherein for the fixed-point input data in the size of 16-bit, a preferred range of the second fixed-point quantization parameter is 30 to 47 bits. 4. The data processing method according to claim 1 , wherein processing the fixed-point input data to generate the fixed-point output data further comprises: for the fixed-point input data in a size of 8-bit, multiplying the fixed-point input data with a first fixed-point quantization parameter in size S8 or U8 to receive a first output; summing up the first output and a second fixed-point quantization parameter in size S31 to receive a second output; right shifting with rounding the second output with an accumulator shift in size U8 to receive a third output; clamping the third output to receive a fourth output in size S16; multiplying the fourth output with an output scale in size U16 to receive a fifth output; right shifting with rounding the fifth output with an output shift in size U8 to receive a sixth output; and clamping the sixth output into the fixed-point output data in size S8 or U8. 5. The data processing method according to claim 4 , wherein for the fixed-point input data in the size of 8-bit, a preferred range of the second fixed-point quantization parameter is 15 to 31 bits. 6. The data processing method according to claim 1 , wherein mapping the fixed-point scale layer to the fixed-point convolution layer further comprises: multiplying the fixed-point input data with a filter weight for the channel in the fixed-point scale layer to receive a product; and summing up the product and a bias for the channel in the fixed-point scale layer. 7. The data processing method according to claim 1 , wherein the matrix multiplication is executed on a GEMM engine or a Multiply-Accumulate (MAC) operations array. 8. An apparatus for implementing a neural network, comprising: one or more processors; and a memory configured to store instructions executable by the one or more processors; wherein the one or more processors, upon execution of the instructions, are configured to: receive floating-point input data over a channel of a standalone floating-point scale layer in a neural network, and convert the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtain fixed-point quantization parameters in each channel based on input data and two floating-point parameters γ i , β i in each channel, wherein the standalone floating-point scale layer comprises a plurality of channels, and the fixed-point quantization parameters are generated separately for each of the plurality of channels; convert the standalone floating-point scale layer based on the fixed-point quantization parameters into a fixed-point scale layer in the neural network for processing the fixed-point input data to generate fixed-point output data; and map the fixed-point scale layer to a fixed-point convolution layer, wherein computation of convolution is done by matrix multiplication executed on a GEMM engine. 9. The apparatus of claim 8 , wherein the one or more processors are further configured to: for the fixed-point input data in a size of 16-bit, multiply the fixed-point input data with a first fixed-point quantization parameter in size S16 to receive a first output; sum up the first output and a second fixed-point quantization parameter in size S47 to receive a second output; right shift with rounding the second output with an accumulator shift in size U8 to receive a third output; clamp the third output to receive a fourth output in size S32; multiply the fourth output with an output scale in size U16 to receive a fifth output; right shift with rounding the fifth output with an output shift in size U8 to receive a sixth output; and clamp the sixth output into the output data in size S16 or U16. 10. The apparatus of claim 9 , wherein for the fixed-point input data in the size of 16-bit, a preferred range of the second fixed-point quantization parameter is 30 to 47 bits. 11. The apparatus of claim 8 , wherein the one or more processors are further configured to: for the fixed-point input data in a size of 8-bit, multiply the fixed-point input data with a first fixed-point quantization parameter in size S8 or U8 to receive a first output; sum up the first output and a second fixed-point quantization parameter in size S31 to receive a second output; right shift with rounding the second output with an accumulator shift in size U8 to receive a third output; clamp the third output to receive a fourth output in size S16; multiply the fourth output with an output scale in size U16 to receive a fifth output; right shift with rounding the fifth output with a third parameter in size U8 to receive a sixth output; and clamp the sixth output into the fixed-point output data in size S8 or U8. 12. The apparatus of claim 11 , wherein for the fixed-point input data in the size of 8-bit, a preferred range of the second fixed-point quantization parameter is 15 to 31 bits. 13. The apparatus of claim 8 , the one or more processors are further configured to: multiply the fixed-point input data with a filter weight for the channel in the fixed-point scale layer to receive a product; and sum up the product and a bias for the channel in the fixed-point scale layer. 14. The apparatus of claim 8 , wherein the matrix multiplication is executed on a GEMM engine or a MAC operations array. 15. A non-transitory computer readable storage medium, comprising instructions stored therein to implement a neural network, wherein, upon execution of the instructions by one o

Assignees

Inventors

Classifications

  • Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G06N3/04Primary

    Architecture, e.g. interconnection topology · CPC title

  • G06N3/063Primary

    using electronic means · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12293275B2 cover?
A method to implement a fixed-point scale layer in a neural network for data processing is provided in the present disclosure. The method includes: receiving fixed-point input data over a channel of a standalone floating-point scale layer, and converting the floating-point input data into fixed-point input data of the standalone floating-point scale layer; obtaining fixed-point quantization par…
Who is the assignee on this patent?
Kwai Inc, Beijing Dajia Internet Information Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 06 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).