Denormalization in multi-precision floating-point arithmetic circuitry
US-10678510-B2 · Jun 9, 2020 · US
US11775257B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11775257-B2 |
| Application number | US-202016840847-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 6, 2020 |
| Priority date | Jun 5, 2018 |
| Publication date | Oct 3, 2023 |
| Grant date | Oct 3, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for operating on and calculating binary floating-point numbers using an enhanced floating-point number format are presented. The enhanced format can comprise a single sign bit, six bits for the exponent, and nine bits for the fraction. Using six bits for the exponent can provide an enhanced exponent range that facilitates desirably fast convergence of computing-intensive algorithms and low error rates for computing-intensive applications. The enhanced format can employ a specified definition for the lowest binade that enables the lowest binade to be used for zero and normal numbers; and a specified definition for the highest binade that enables it to be structured to have one data point used for a merged Not-a-Number (NaN)/infinity symbol and remaining data points used for finite numbers. The signs of zero and merged NaN/infinity can be “don't care” terms. The enhanced format employs only one rounding mode, which is for rounding toward nearest up.
Opening claim text (preview).
What is claimed is: 1. A system, comprising: a memory that stores computer-executable components; and a processor, operatively coupled to the memory, that executes computer-executable components, the computer-executable components comprising: a calculator component that facilitates operation on and calculation of binary floating-point numbers by the processor in accordance with a defined 16-bit floating-point number format, in connection with execution of a machine learning application, wherein the defined 16-bit floating-point number format utilizes greater than five bits in an exponent field, wherein the defined 16-bit floating-point number format utilizes a first binade to represent zero and normal numbers, wherein the first binade is associated with the exponent field having all zeros, and wherein a normal number of the normal numbers is a finite non-zero floating-point number with a magnitude greater than or equal to a minimum value that is determined as a function of a radix and a minimum exponent associated with the defined 16-bit floating-point number format, and wherein the defined 16-bit floating-point number format is applied to the machine learning application and results in reduced error rates and improved convergence time; an operation management component operatively coupled to the calculator component and the processor, wherein the operation management component: allocates a first portion of operations of the calculator component and associated data to a set of lower precision computation engines; and an enhanced format component that generates the defined 16-bit floating-point number format employed by the processor and the calculator component to calculate the binary floating-point numbers. 2. The system of claim 1 , wherein the defined 16-bit floating-point number format utilizes six bits in the exponent field and facilitates the machine learning algorithm and deep learning training algorithms, and wherein the exponent field is adjacent a sign field comprising one bit of data representing a sign of the floating-point number. 3. The system of claim 2 , wherein the calculator component generates an arbitrary value or symbol in the sign field of the defined 16-bit floating-point number format to reduce hardware complexity and based on a generation of a zero result for a binary floating-point number of the binary floating-point numbers. 4. The system of claim 1 , wherein the operation management component also allocates a second portion of the operations of the calculator component and second associated data to a set of higher precision computation engines. 5. The system of claim 1 , wherein the set of lower precision computation engines comprises computation engines comprising 16-bit floating-point units, and wherein the set of higher precision computation engines comprises computation engines comprising 32-bit floating-point units or 64-bit floating-point units. 6. The system of claim 1 , wherein the defined 16-bit floating-point number format comprises a 1/6/9 format having a single sign bit, a six bit exponent and a nine bit mantissa, wherein the processor employs the defined 16-bit floating-point number format as an arithmetic computation format as well as a data-interchange format. 7. The system of claim 1 , wherein the defined 16-bit floating-point number format defines a sign of zero as being a don't care term for selected ones of defined applications, the defined applications comprising deep learning applications or machine learning applications. 8. The system of claim 1 , wherein a data point of the first binade has a fraction of all zeros and represents zero, and other data points of the first binade represent the normal numbers. 9. The system of claim 1 , wherein the defined 16-bit floating-point number format utilizes a second binade associated with the exponent field having all ones, wherein the defined floating-point number format employs a reduced set of data points in the second binade to represent an infinity value and a not-a-number value, and wherein the reduced set of data points comprises less data points than a set of data points associated with an entirety of the second binade. 10. The system of claim 1 , wherein, in accordance with the defined 16-bit floating-point number format, the calculator component represents a sign of a value of zero as a term that indicates that the sign does not matter with respect to the value of zero, wherein the processor generates an arbitrary value in a sign field of the defined 16-bit floating-point number format to represent the term, and wherein the generation of the arbitrary value utilizes less resources than a determination and a generation of a non-arbitrary value for the sign field. 11. The system of claim 1 , wherein, in accordance with the defined 16-bit floating-point number format, the calculator component represents a not-a-number value and an infinity value together as a merged symbol, wherein the calculator component represents a sign of the merged symbol as a term that indicates that the sign does not matter with respect to the merged symbol, wherein the processor generates an arbitrary value in a sign field of the defined floating-point number format to represent the term, and wherein the generation of the arbitrary value utilizes less resources than a determination and a generation of a non-arbitrary value for the sign field. 12. The system of claim 1 , wherein, in accordance with the defined 16-bit floating-point number format, the calculator component utilizes only one rounding mode to perform rounding values of the binary floating-point numbers, to facilitate enhancing efficiency of the system by reducing hardware utilized to execute the application and to operate on and calculate the binary floating-point numbers, and wherein the one rounding mode is a round-nearest-up mode. 13. A computer-implemented method, comprising: generating, by a system operatively coupled to a processor, respective numerical fields in a defined 16-bit floating-point number format, wherein the respective numerical fields comprise a sign field, an exponent field, and a mantissa field, wherein the defined 16-bit floating-point number format utilizes greater than five bits in the exponent field, and wherein the defined 16-bit floating-point number format utilizes a first binade to represent zero and normal numbers, wherein the first binade is associated with the exponent field having all zeros, and wherein a normal number of the normal numbers is a finite non-zero floating-point number with a magnitude greater than or equal to a minimum value that is determined as a function of a radix and a minimum exponent associated with the defined 16-bit floating-point number format; calculating, by the system, binary floating-point numbers in accordance with the defined 16-bit floating-point number format, in connection with execution of a deep learning application, wherein the defined 16-bit floating-point number format is applied to the deep learning application and results in reduced error rates and improved convergence time; and allocating, by the system, a first portion of operations of the calculator component and associated data to a set of lower precision computation engines. 14. The computer-implemented method of claim 13 , wherein the defined 16-bit floating-point number format utilizes six bits in the exponent field that facilitates machine learning algorithms and deep learning training algorithms, and wherein the exponent field is adjacent a sign field comprising one bit of data representing a sign of the floating-point number. 15. The computer-implemented
Mantissa overflow or underflow in handling floating-point numbers · CPC title
Rounding towards positive infinity (G06F7/49957 takes precedence) · CPC title
Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.