The general name for the Decimal Point is Radix Point
Computers store the sign, exponent and mantissa of a floating-point number
Mantissa is also called Fraction
Biasing
It is the process of offsetting numbers in a series by a fixed value
Assume we have 4 bits to store the exponent of a floating-point number
Using 4 bits we can represent 16 unique values
Exponents can be positive and negative so effectively we can represent exponents ranging from -8 to 7
Next we select the largest number from the series (in our case 8) and add it to all the numbers in the series
This will give us a new series with numbers ranging from 0 to 15
Using the new series negative exponents can also be stored as a positive value
Range | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Signed Range | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
With Bias | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
Assume we have 10 bits to store floating point numbers
1 bit for sign, 4 bits for exponent and 5 bits for mantissa
Floating Number | Scientific Notation | Sign | Exponent | Mantissa |
---|---|---|---|---|
0 | 0111 | 10100 | ||
0 | 1010 | 01101 |
Normalization
The process of representing a floating-point number in scientific notation
Explicit Normalization
Move radix point to the LHS of the most significant 1 in the bit sequence
Formula:
The last 1 is dropped since the machine does not have space to store it
Converting to Decimal:
Implicit Normalization
Move radix point to the RHS of the most significant 1 in the bit sequence
Formula:
Implicit nomination allows to stores values with higher precision
Converting to Decimal:
IEEE 754 Standard
Name | Common Name | Significant bits | Exponent bits | Exponent Bias |
---|---|---|---|---|
binary16 | Half Precision | 11 | 5 | 15 |
binary32 | Single Precision | 24 | 8 | 127 |
binary64 | Double Precision | 53 | 11 | 1023 |
binary128 | Quadruple Precision | 113 | 15 | 16383 |
binary256 | Octuple Precision | 237 | 19 | 262143 |
Significant Bits: Sign + Mantissa
Programming languages implement Single and Double Precision Floats
When 5 bits are reserved for exponent we have 32 unique combinations (0-31)
If we consider signed numbers as well then the range becomes -16 to 15
In the IEEE 754 standard the exponent pattern all 0s and all 1s are reserved
So the range of the exponent effectively becomes -14 to 15
Exponent | Mantissa | Represents | |
---|---|---|---|
All 0s | All 0s | ||
All 1s | All 0s | ||
Any value | Implicit Normal Form | ||
All 0s | Fractional Form | ||
All 1s | NaN | Exception Handling |
Precision
Decimal Precision:
Single Precision Floats:
Double Precision Floats:
What is the difference between float and double? - Stack Overflow