Lecture 003

Limitation of Naive Representation

Representation: using bits such as 1/2, 1/4, 1/8...

IEEE 754 Standard Floating Point

Float: (-1)^s * M * 2^E Sign Bit (s): whether a number is positive or negative Exponent (E): power of x Significand (M):

Single precision: 32 bits = 1 + 8 + 23

Double precision: 64 bits = 1 + 11 + 52

Representation of Float

Normalized

Normalized: exponent neither all 0s nor all 1s

Exponent (E)

Bias(constant) = 2^{k-1} - 1

exp: the actual bits stored as exponent

E = exp(unsigned) - Bias

Significand (M)

frac: the actual bits stored as significand

M = 1.frac

Denormalized

Denormalized: exponent all 0s

Exponent (E)

E(constant) = 1 - Bias = 1 - (2^{k-1} - 1) = -(2^{k-1}-2)

Significand (M)

M = 0.frac

Special

Special: exponent all 1s

infinity: exp = all 1s and frac = all 0s Nan: exp = all 1s and frac != all 0s

Value Comparison

All Float in Line

All Float in Line

Float table

Float table

Spacing

Spacing

Properties of Representation

  1. float 0 are almost integer 0 (except the first bit)
  2. can almost use unsigned integer comparison

  3. but must first compare sign bit

  4. must consider -0 = 0

  5. not sure if NaN > inf

Rounding

Different Rounding Modes

Different Rounding Modes

Rounding to Even in Binary

Half: indicated by 10000000000... Even: indicated by 0 as least significant bit

DO:

Round Table

Round Table

Multiplication

Sign: s1 ^ s2 Significand: M1 * M2

Addition

Addition

Addition

Properties

Arithmetic

Casting

Table of Content