Residual quantization for neural networks
US-2020193273-A1 · Jun 18, 2020 · US
US11676003B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11676003-B2 |
| Application number | US-201816223603-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 18, 2018 |
| Priority date | Dec 18, 2018 |
| Publication date | Jun 13, 2023 |
| Grant date | Jun 13, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Technology related to training a neural network accelerator using mixed precision data formats is disclosed. In one example of the disclosed technology, a neural network accelerator is configured to accelerate a given layer of a multi-layer neural network. An input tensor for the given layer can be converted from a normal-precision floating-point format to a quantized-precision floating-point format. A tensor operation can be performed using the converted input tensor. A result of the tensor operation can be converted from the block floating-point format to the normal-precision floating-point format. The converted result can be used to generate an output tensor of the layer of the neural network, where the output tensor is in normal-precision floating-point format.
Opening claim text (preview).
We claim: 1. A computing system comprising: a computer-readable memory storing an operational parameter of a given layer of a neural network; and a hardware accelerator in communication with the computer-readable memory for accelerating tensor operations, the hardware accelerator configured to: receive an input tensor for a given layer of a multi-layer neural network; convert the input tensor from a normal-precision floating-point format to a quantized-precision floating-point format, the quantized-precision floating-point format being a block floating-point format, wherein a first converted input tensor portion corresponding to a first portion of the input tensor comprises a first common exponent for values in the first portion of the input tensor and a first plurality of mantissa values and a second converted tensor portion corresponding to a second portion of the input tensor comprises a second common exponent value for values in the second portion of the input tensor and a second plurality of mantissa values, wherein the first common exponent is different than the second common exponent; perform a tensor operation using the input tensor converted to the quantized-precision floating-point format; convert a result of the tensor operation from the quantized-precision floating-point format to the normal-precision floating-point format to provide a converted result in the normal-precision floating-point format; and in a training iteration of a plurality of iterations of training of the multi-layer neural network, updating the operational parameter of the given layer of the multi-layer neural network stored in the computer-readable memory using the converted result in the normal precision floating-point format, where the operational parameter of the given layer of the neural network is stored in normal-precision floating-point format. 2. The computing system of claim 1 , wherein the input tensor is a two-dimensional matrix, and the quantized-precision floating-point format is a block floating-point format where a plurality of mantissa values within a given row share a common exponent, and mantissa values in different rows have different respective exponents. 3. The computing system of claim 1 , wherein the input tensor is a convolution filter, and the quantized-precision floating-point format is a block floating-point format where a plurality of mantissa values within a spatial pixel share a common exponent. 4. The computing system of claim 1 , wherein the tensor operation is performed during a back-propagation mode of the neural network, the input tensor is an output error term from an adjacent layer to the given layer. 5. The computing system of claim 1 , wherein the tensor operation is a dot product computation. 6. The computing system of claim 1 , wherein the tensor operation is a convolution. 7. The computing system of claim 1 , wherein the tensor operation is performed during a back-propagation phase of training the neural network. 8. The computing system of claim 1 , wherein using the converted result in the normal-precision floating-point format to update the operational parameter comprises performing a scalar operation that uses the converted result in the normal-precision floating-point format to generate the operational parameter. 9. The computing system of claim 8 , wherein the scalar operation is performed for a single layer of the neural network. 10. The computing system of claim 8 , wherein the scalar operation comprises adding a bias to the converted result. 11. The computing system of claim 8 , wherein the scalar operation comprises applying an activation function to the converted result. 12. A method, implemented in a computing system, comprising: converting an input tensor for a given layer of a multi-layer neural network from a normal-precision floating-point format to converted values represented in a block floating-point format by (1) for a first portion on the input tensor, selecting a first bounding box including a first set of values expressed in the normal-precision floating-point format and where the block floating-point format uses a first common exponent for converted values of the first set of values; and (2) for a second portion of the input tensor, selecting a second bounding box comprising a second set of values expressed in the normal-precision floating point format and where the block-floating point format uses a second common exponent for converted values of the second set of values, where the second set of values is different from the first set of values and the second common exponent is different from the first common exponent; performing a tensor operation using the converted values in the input tensor converted to the block floating-point format; converting a result of the tensor operation from the block floating-point format to the normal-precision floating-point format; using the converted result in the normal-precision floating-point format to generate an output tensor of the layer of the neural network, where the output tensor is in normal-precision floating-point format; and in a training iteration of a plurality of iterations of training of the multi-layer neural network, updating an operational parameter of the multi-layer neural network using the converted result in the normal precision floating-point format, where the operational parameter of the given layer of the neural network is maintained in normal-precision floating-point format. 13. The method of claim 12 , further comprising loading configuration data onto programmable hardware of a neural network accelerator so that the programmable hardware performs the operations of the given layer of a multi-layer neural network. 14. The method of claim 12 , further comprising, in a neural network accelerator, initializing weights of input edges of the given layer of the multi-layer neural network. 15. The method of claim 12 , wherein the first bounding box is a row of a matrix of the input tensor. 16. The method of claim 12 , first bounding box is a column of a matrix of the input tensor. 17. The method of claim 12 , wherein converting the input tensor for the given layer from the normal-precision floating-point format to the block floating-point format comprises: selecting a bounding box for a plurality of elements of the input tensor; identifying a shared exponent for the first set of values within the bounding box of the input tensor; scaling mantissa values of the elements of the input tensor so that integer portions of the scaled mantissas have a selected number of bits for the block floating-point format; removing fractional bits from the scaled integer portions of the mantissas; and rounding the mantissas to produce block floating-point values. 18. The method of claim 12 , wherein the multi-layer neural network is a recurrent neural network, further comprising configuring a neural network accelerator to accelerate the given layer of the multi-layer neural network comprises programming hardware to perform a function of a layer of the recurrent neural network. 19. The method of claim 12 , performed in a neural network accelerator. 20. One or more non-transitory computer-readable media comprising: computer-executable instructions that, when executed by a computing device, cause the computing device to convert an input tensor for a given layer of a multi-layer neural network from a normal-precision floating-point format to a block floating-point format, by (1) for a first portion of the input tensor, selecting a
using electronic means · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Correlation function computation {including computation of convolution operations (arithmetic circuits for sum of products per se, e.g. multiply-accumulators G06F7/5443; digital filters, e.g. FIR, IIR, adaptive filters H03H17/00)} · CPC title
Combinations of networks · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.