Representation: using bits such as 1/2, 1/4, 1/8...
Float: (-1)^s * M * 2^E Sign Bit (s): whether a number is positive or negative Exponent (E): power of x Significand (M):
Single precision: 32 bits = 1 + 8 + 23
7 decimal digits
10^(+-38)
Double precision: 64 bits = 1 + 11 + 52
16 decimal digits
10^(+-308)
Normalized: exponent neither all 0s nor all 1s
Bias(constant) = 2^{k-1} - 1
exp: the actual bits stored as exponent
[1, 2^{k}-2]
float: [1, 254]
double: [1, 2046]
other: [1, 2^k]
increase as the represented number increase
E = exp(unsigned) - Bias
[-(2^{k-1}-2), 2^{k-1}-1]
float: [-126, 127]
double: [-1022, 1023]
frac: the actual bits stored as significand
M = 1.frac
Denormalized: exponent all 0s
E(constant) = 1 - Bias = 1 - (2^{k-1} - 1) = -(2^{k-1}-2)
original exp = 0, original E = -127
now exp = 1, now E = -126
M = 0.frac
Special: exponent all 1s
infinity: exp = all 1s and frac = all 0s Nan: exp = all 1s and frac != all 0s
can almost use unsigned integer comparison
but must first compare sign bit
must consider -0 = 0
not sure if NaN > inf
Half: indicated by 10000000000... Even: indicated by 0 as least significant bit
DO:
if what gets rounded(round bit) start with 0
else
Sign: s1 ^ s2 Significand: M1 * M2
Arithmetic
closed: Yes, but may generate infinity and NaN
commutative: Yes
Associative: No
has additive inverse: Yes, but not for infinity and NaN
monotonicity: except infinity and NaN
Casting
double / float -> int
int -> double
int -> float
Table of Content