Lecture 003

Limitation of Naive Representation

Representation: using bits such as 1/2, 1/4, 1/8...

limitation
- can only represent x/2^k, not 1/3, 1/5, 1/10
- can't control the precision

IEEE 754 Standard Floating Point

Float: (-1)^s * M * 2^E Sign Bit (s): whether a number is positive or negative Exponent (E): power of x Significand (M):

Single precision: 32 bits = 1 + 8 + 23

7 decimal digits
10^(+-38)

Double precision: 64 bits = 1 + 11 + 52

16 decimal digits
10^(+-308)

Representation of Float

Normalized

Normalized: exponent neither all 0s nor all 1s

Exponent (E)

Bias(constant) = 2^{k-1} - 1

exp: the actual bits stored as exponent

[1, 2^{k}-2]
float: [1, 254]
double: [1, 2046]
other: [1, 2^k]
increase as the represented number increase

E = exp(unsigned) - Bias

[-(2^{k-1}-2), 2^{k-1}-1]
float: [-126, 127]
double: [-1022, 1023]

Significand (M)

frac: the actual bits stored as significand

increase as the represented number increase

M = 1.frac

float/double: [1.0, 2.0)

Denormalized

Denormalized: exponent all 0s

Exponent (E)

E(constant) = 1 - Bias = 1 - (2^{k-1} - 1) = -(2^{k-1}-2)

original exp = 0, original E = -127
now exp = 1, now E = -126

Significand (M)

M = 0.frac

Special

Special: exponent all 1s

infinity: exp = all 1s and frac = all 0s Nan: exp = all 1s and frac != all 0s

Value Comparison

Properties of Representation

float 0 are almost integer 0 (except the first bit)
can almost use unsigned integer comparison
but must first compare sign bit
must consider -0 = 0
not sure if NaN > inf

Rounding

Rounding to Even in Binary

Half: indicated by 10000000000... Even: indicated by 0 as least significant bit

DO:

if what gets rounded(round bit) start with 0
- then round down
else
- if what gets rounded is 1(round bit) 00000...(sticky bit)
  - round to nearest
    - if the lowest rounded bit (guard bit) is 0, keep it
    - if the lowest rounded bit (guard bit) is 1, flip it
- else
  - round up

Multiplication

Sign: s1 ^ s2 Significand: M1 * M2

if M >= 2, M(unsigned) >>= 1, E += 1 (if E overflow, infinity) Exponent: E1 + E2

Addition

Properties

Arithmetic

closed: Yes, but may generate infinity and NaN
commutative: Yes
Associative: No
has additive inverse: Yes, but not for infinity and NaN
monotonicity: except infinity and NaN

Casting

double / float -> int
- truncate fractional part (rounding to 0)
- undefined when out of range or NaN (generally set to TMin)
int -> double
- exact conversion as long as int has <53 word size
int -> float
- will round

Table of Content