Mass of an electron: 9x10e-28 grams to Mass of the sun 2x10e33; range exceeds10e60
No way we can fit a range of 10e60 values in 8 bits, or even 32 bits. It would take about 200 bits.
So, we handle this in a computer the same way we do ourselves: scientific notation. Just translate the number into binary and represent it with a sign, an exponent, and a mantissa (fraction). We'll put all three parts in a single word (single precision), or in two words (double precision). The base will be fixed for all numbers, so there is no need to represent it explicitly.
The mantissa is typically represented in sign magnitude and normalized
that all of the positions can be used for precision.
Example using base 10:
6875 x 10e3 represented as 6.875 x 10e6 or .6875 x 10e7
The term floating-point is used because the position of the radix point is adjusted during computation to retain the normal form.
| sign bit | 7-bit exponent | 24-bit fractional mantissa
Here is an intuitive way to convert a decimal value into a binary fraction.
Consider what the positions mean:
d1 d0 . d-1 d-2 d-3 d-4 d-5 d-6 d-7 d-8
2e1 2e0 . 2e-1
2 1 . 1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256
. 128/256 64/256 32/256 16/256 8/256 4/256 2/256 1/256
In general, the fractional places will range from 1/2 to 1/(2eN), where N is the number of bits.
Procedure: express everything as a fraction of 2eN (here, 256), then use our conversion strategy for integers.
Convert .625: .625 = X/256; .625 * 256 = 160; .625 = 160/256.
We want to find which of the digits should be 1, such that those numerators
sum to 160.
160 = 128 + 32. So, .625 decimal = .10100000 binary.
Another example: .6875 using 4 bits.
The positions: . 1/2
. 8/16 4/16 2/16 1/16
Express .6875 as a fraction of 16: .6875 = 11/16.
So, .6875 in 4 bits is: .1011
100011.1011 = .1000111011 * 2e+6.
In our representation above:
| 0 | 0000110 | 1000111011 |
Obviously, there is a tradeoff between size (the more bits for the exponent, the larger the range of numbers) and accuracy (the more bits for the mantissa, the more significant digits can be represented).
In 7 bits, with an implied base of 2, we are limited to exponents between -64 to +63. This is not enough of a range.
If we can help it, we don't want to reduce the size of the mantissa field, because then we will lose accuracy.
What can we do?
Change the implied base to a larger power of 2 (4, 8, or 16, say).
Suppose the base is 16. Range roughly 10e-76 to 10e76, while still using 7 bits for the mantissa.
Let's consider our example.
Unnormalized, we had 100011.1011
Normalize it: .001000111011 * 16e2
Our floating point representation:
0 | 0000010 | 001000111011000000000000 |
Shifting the mantissa mantissa to perform normalization must take place in steps of 4-bit shifts. A representation is now considered to be normalized if any of the leading 4 bits of its mantissa is 1. This is called Hexadecimal normalization.
So, what is the tradeoff of using the larger base? Extra range at the expense of precision. Even though 24 bits are still used for the mantissa, hex normalization allows the three leading bits of a mantissa to be 0s. So, in some cases, only 21 signficant bits are retained. equires I bits.
Another option: use base 2 and a phantom bit
If the base is 2, then the most signifiant bit of a normalized mantissa
is always 1. So, why store it? This way, you can get 24 bits
of precision in 23 bits, and use the extra bit for the exponent.
This simplifies the circuitry needed for comparing exponents.
Example: -0.00000101... * 16e3
Unnormalized: | 1 | 0000011 | 00000101.... |
Normalized, base-16 representation:
| 1 | 0000010 | 01010000....
Represented in excess-64, base-16 representation:
| 1 | 1000010 | 01010000...|
Recall in base 10: 2.94 * 10e2 + 4.31 * 10e4 = 4.3394 * 10e4.
Multiply (Divide) :
The main criterion was to maximize precision while retaining a sufficiently large range.
Single precision (32 bits): 1 sign bit | 8-bit
exponent | 23-bit mantissa
Double precision (64 bits): 1 sign bit | 11-bit exponent | 52-bit mantissa
Binary normalization (implied base is 2)
A phantom bit is used, so that 24 (53) bits of precision are represented using 23 (52) bits.
The 8-bit exponent is in excess-127 representation, where E'= 0 and E'=255 are used to represent special values, such as exact 0 and infinity.
E: -127 |
-126 127 |
+127 | +127 +127 | +127
E' 0 | 1 254 | 255
Actually, a more precise version is the following. Recall that $FF_us = 255 and $FF_2c = -1:
-127 | -126
+127 +127 | +127 +127 |
E' -1 0 | 1 254 |
Where -1 ($FF) and 0 are the end values used to represent special values.