Floating-point numbers

In many calculations, the range of numbers is very large.  Examples:

Mass of an electron: 9x10e-28 grams  to Mass of the sun 2x10e33;  range exceeds10e60

No way we can fit a range of 10e60 values in 8 bits, or even 32 bits. It would take about 200 bits.

 So, we handle this in a computer the same way we do ourselves: scientific notation. Just translate the number into binary and represent it with a sign, an exponent, and a mantissa (fraction). We'll put all three parts in a single word (single precision), or in two words (double precision). The base will be fixed  for all numbers, so there is no need to represent it explicitly.

Basic Format

The assumed base of the exponent is a power of 2, not 10.   The exponent is 1's, 2's, or excess format (later).

The mantissa is typically represented in sign magnitude and normalized so that all of the positions can be used for precision.

Example using base 10:

        6875 x 10e3 represented as 6.875 x 10e6 or .6875 x 10e7

The term floating-point is used because the position of the radix point is adjusted during computation to retain the normal form.

Example format:

|  sign bit |  7-bit exponent |  24-bit fractional mantissa |

Converting decimal fractions into binary fractions

An important difference between real numbers and their representation on the computer is their density. Real numbers form a continuum; between any two real numbers no matter how close there is a real number between them. On a computer, we are using a fixed number of bits, say n, for the fraction field (the mantissa). Only 2eN different bit patterns, so only 2eN different fractions can be represented. Truncation or rounding is performed.

Here is an intuitive way to convert a decimal value into a binary fraction.

Consider what the positions mean:

      d1     d0  .   d-1          d-2           d-3          d-4          d-5         d-6        d-7            d-8

     2e1   2e0 .  2e-1         2e-2        2e-3        2e-4         2e-5      2e-6        2e-7         2e-8
     2       1      .  1/2            1/4            1/8            1/16          1/32      1/64        1/128        1/256
                      .  128/256    64/256    32/256       16/256     8/256    4/256      2/256        1/256

In general, the fractional places will range from 1/2 to 1/(2eN), where N is the number of bits.

Procedure:  express everything as a fraction of 2eN (here, 256), then use our conversion strategy for integers.

Convert .625:      .625 = X/256;    .625 * 256 = 160;  .625 = 160/256.

We want to find which of the digits should be 1, such that those numerators sum to 160.
160 = 128 + 32.   So, .625 decimal = .10100000 binary.

Another example:  .6875 using 4 bits.

The positions:  . 1/2         1/4         1/8         1/16
                            . 8/16       4/16       2/16       1/16

Express .6875 as a fraction of 16:  .6875  = 11/16.

So,  .6875 in 4 bits is:  .1011

So, suppose we want to convert 35.6875 decimal to the floating point representation above.

100011.1011 = .1000111011 * 2e+6.

In our representation above:

| 0  |  0000110  |  1000111011 |

Representing Larger Ranges

Hexadecimal Normalization

Obviously, there is a tradeoff between size (the more bits for the exponent, the larger the range of numbers) and accuracy (the more bits for the mantissa, the more significant digits can be represented).

In 7 bits, with an implied base of 2, we are limited to exponents between -64 to +63.  This is not enough of a range.

If we can help it, we don't want to reduce the size of the mantissa field, because then we will lose accuracy.

What can we do?

Change the implied base to a larger power of 2 (4, 8, or 16, say).

Suppose the base is 16.   Range  roughly 10e-76 to 10e76, while still using 7 bits for the mantissa.

Let's consider our example.

Unnormalized, we had     100011.1011
Normalize it:                 .001000111011 * 16e2

Our floating point representation:
    0  |  0000010  |  001000111011000000000000 |

Shifting the mantissa mantissa to perform normalization must take place in steps of 4-bit shifts.  A representation is now considered to be normalized if any of the leading 4 bits of its mantissa is 1.   This is called Hexadecimal normalization.

So, what is the tradeoff of using the larger base? Extra range at the expense of precision. Even though 24 bits are still used for the mantissa, hex normalization allows the three leading bits of a mantissa to be 0s. So, in some cases, only 21 signficant bits are retained. equires I bits.

Another option:  use base 2 and a phantom bit

If the base is 2, then the most signifiant bit of a normalized mantissa is always 1.  So, why store it?  This way, you can get 24 bits of precision in 23 bits, and use the extra bit for the exponent.

Excess Notation

Instead of representing the exponent in a signed 2's-complement format, represent it in excess 2e(N-1), where N is the number of bits in the exponent. So, in our case, with 7 bits, we have excess-64. An exponent having the signed value E is represented by the value E' = E + 64. So, what is the range? 0 through 127.

This simplifies the circuitry needed for comparing exponents.

Example:   -0.00000101... * 16e3

    Unnormalized:  | 1  |  0000011  |  00000101.... |

    Normalized, base-16 representation:

    | 1  |  0000010  |  01010000.... |

Represented in excess-64, base-16 representation:

   | 1  |  1000010 |   01010000...|


In our example format,   underflow occurs if the exponent < -64; overflow occurs if the exponent > 63.   Events like these are often called exceptions.  A typical way to handle them is to raise an interrupt when they occur.

Floating Point Arithmetic

General procedures for addition, subtraction, multiplication, and division.
(Assuming hex normalized, 24-bit mantissa, base 16 and 7-bit signed exponent in excess 64 format).

Recall in base 10: 2.94 * 10e2 + 4.31 * 10e4  =  4.3394 * 10e4.


  1. Choose the number with the smaller exponent and shift its mantissa right (in 4-bit steps) a number of steps equal to the difference in exponents.
  2. Set the exponent of the result equal to the larger exponent
  3. Perform add/sub on the mantissas and determine the sign of the result
  4. Normalize the resulting value if necessary and use the first 24 bits after the radix point, truncated.

Multiply (Divide) :

  1. Add (sub) the exponents and subtract (add) 64

  2. s + 64 - (y + 64) = s-y + 0
    s + 64 + y + 64 = s + y + 2* 64
  3.  Multiply the mantissas and determine the sign of the result
  4.  Normalize the resulting value if nec. and use the first 24 bits.

IEEE Floating Point Standard

Standard to support portability.  The standard includes the format, operations, and exception handling.  We will consider the data format.

The main criterion was to maximize precision while retaining a sufficiently large range.

Single precision   (32 bits):  1 sign bit |  8-bit exponent   |  23-bit mantissa
Double precision (64 bits):  1 sign bit |  11-bit exponent |  52-bit mantissa

Binary normalization (implied base is 2)
A phantom bit is used, so that 24 (53) bits of precision are represented using 23 (52) bits.
The 8-bit exponent is in excess-127 representation, where E'= 0 and E'=255 are used to represent special values, such as exact 0 and infinity.

E:       -127    |    -126         127   |      ???
            +127   |     +127     +127    |   +127
E'               0   |           1        254     |    255

Actually, a more precise version is the following.  Recall that $FF_us = 255 and $FF_2c = -1:

E:       -128     -127    |    -126        127   |
            +127     +127   |     +127     +127   |
E'           -1             0   |           1        254    |

Where -1 ($FF) and 0 are the end values used to represent special values.