Learners Only: Floating point representation

The floating point representation of a number has two parts. The first parts represent a signed, fixed- point number called the mantissa. The second part designates the position of the decimal point and is called the exponent. The fixed-point mantissa may be a fraction or an integer. For example, the decimal number +6132.789 is represented in floating-point with a fraction and an exponent as follows:

Fraction Exponent

+0.6132789 +04

This representation is equivalent to the scientific notation +0.6132789*10^+4.

Floating –point is always interpreted to represent a number in the following form:

M*r ^e

Only the mantissa m and the exponent e are physically represented in the register. The radix r and the radix – point potion of the mantissa are always assumed.

A floating-point binary number is represented in a similar manner except that it uses base 2 for the exponent. For example, the binary number +1001.11is represented with an 8-bit fraction and 6-bit exponent as follows:

Fraction Exponent

01001110 000100

The fraction has 0 in the leftmost potion to denote positive. The binary point of the fraction follows the sign bit but is not shown in the register. The exponent has the equivalent binary number + 4. The floating – point number is equivalent to

M * 2^e=+ (.1001110) 2*2 ^+4

A floating –point number is said to be normalized if the most significant digit of the mantissa is nonzero. For example, the decimal number 350 is normalized but 00035 is not. Regardless of where the position of the radix point is assumed to be in the mantissa, the number is normalized only if its leftmost digit is nonzero. For example, the 8-bit binary number 00011010 is not normalized because of the three leading0’s. The number can be normalized by shifting it three positions to the left and discarding the leading 0’s to obtain 11010000. The three shift multiply the number by 2^3=8. Normalized numbers provide the maximum possible precision for the floating-point number.

Fixed Point Representation

A number may have a binary point. The position of the binary point is needed to represent fractions, integer, or mixed integer-fraction numbers. There two ways of specifying the position of the binary point in a register: by giving it a fixed potion or by employing a floating-point representation. The fixed –point method assumes that the binary point is always fixed in one potion. The two potion most widely used are

(1) a binary point in the extreme left of the register to make the stored number a fraction ,and

(2) A binary point in the extreme right of the register to make the stored number an integer.

Integer Representation

When an integer binary number is positive, the sign is represented by 0 and the magnitude by a positive binary number. When the number is negative, the sign is represented by 1 but the rest of the number may be represented in one of three possible ways:

Singed-magnitude representation.

Singed-1’s complement representation.

Singed-2’s complement representation.

As an example, consider the signed number 14 stored in an 8-bit register.+14 is represented by a sign bit of 0 in the leftmost potion followed by the binary equivalent of 14:00001110. There are three different ways to represent -14 with eight bits.

In signed –magnitude representation             1 0001110

In signed –1’s complement representation    1 1110001

In signed –2’s complement representation    1 1110010

Arithmetic addition

The addition of two numbers in the signed – magnitude system follows the rules and the rules of ordinary arithmetic. If the signs are same, we add the two magnitudes and give the sum the common sign. If the sign are different, we subtract the smaller magnitude.

By contrast, the rule for adding no in signed -2’s complement system does not require a comparison or subtraction, only addition and complementation. Add the two numbers, including their sign bits, and discard any carry out of the sign bit position.

EX:

+6 00000110 -6 11111010

+13 00001101 +13 00001101

---------------- ----------

+19 00010011 +7 00000111

+6 00000110 -6 11111010

+13 11110011 -13 11110011

-------------- --------------

-7 11111001 -19 11101101

Arithmetic Subtraction

Subtraction of two signed binary numbers when negative numbers are in 2’s complement form. Take the 2’s complement of the subtrahend. A carry out of the sign bit position is discarded.

This procedure stems form the fact that a subtraction operation can changed to an addition operation if the sign of the subtrahend is changed. This is demonstrated by following relationship:

(+A)- (+B) = (+A) + (-B)

(+A)- (-B) = (+A) + (+B)

Consider the subtraction of (-6) – (-13) = +7.

In binary with 8-bits This s written as 111110010 -11110011.

The subtraction is changed to addition by taking the 2’s complement of the subtrahend (-13) to give (+13).

In binary this is 11111010 + 00001101 = 00000111.

Removing the end carry, we obtain the correct answer 00000111(+7)

Overflow

When two no of n digits each are added and the sum occupies n +1 digits, we say that an overflow occurred. An overflow is a problem in digital computers because the width of register is finite. A result that contains n +1 bits cannot be accommodated in a register with a standard length of n bits. For this reason, many computer detect the occurrence of an overflow, and when it occurs, a corresponding flip – flop is set which can then be checked by the user.

The detection of an over the addition of two binary numbers depends on whether the umbers are considered to be signed or unsigned. When two unsigned numbers are added, an overflow is detected from the end carry out of the most significant position. In the case of signed numbers, the leftmost bit always represents the sign, and negative numbers are in 2’s complement form. When two signed numbers are added, the sign bit is treated as part of the number and the end carry does not indicate an overflow.

Ex:

Carries: 0 1 carries: 1 0

+70 01000110 -70 10111010

+80 01010000 -80 10110000

+150 10010110 -150 01101010

Learners Only

Monday 26 December 2011

Floating point representation

Fixed Point Representation

Integer Representation

Arithmetic addition

Arithmetic Subtraction

Overflow

No comments:

Post a Comment