# Lecture 003

## Limitation of Naive Representation

Representation: using bits such as 1/2, 1/4, 1/8...

• limitation
• can only represent x/2^k, not 1/3, 1/5, 1/10
• can't control the precision

## IEEE 754 Standard Floating Point

Float: (-1)^s * M * 2^E Sign Bit (s): whether a number is positive or negative Exponent (E): power of x Significand (M):

Single precision: 32 bits = 1 + 8 + 23

• 7 decimal digits

• 10^(+-38)

Double precision: 64 bits = 1 + 11 + 52

• 16 decimal digits

• 10^(+-308)

## Representation of Float

### Normalized

Normalized: exponent neither all 0s nor all 1s

#### Exponent (E)

Bias(constant) = 2^{k-1} - 1

exp: the actual bits stored as exponent

• [1, 2^{k}-2]

• float: [1, 254]

• double: [1, 2046]

• other: [1, 2^k]

• increase as the represented number increase

E = exp(unsigned) - Bias

• [-(2^{k-1}-2), 2^{k-1}-1]

• float: [-126, 127]

• double: [-1022, 1023]

#### Significand (M)

frac: the actual bits stored as significand

• increase as the represented number increase

M = 1.frac

• float/double: [1.0, 2.0)

### Denormalized

Denormalized: exponent all 0s

#### Exponent (E)

E(constant) = 1 - Bias = 1 - (2^{k-1} - 1) = -(2^{k-1}-2)

• original exp = 0, original E = -127

• now exp = 1, now E = -126

M = 0.frac

### Special

Special: exponent all 1s

infinity: exp = all 1s and frac = all 0s Nan: exp = all 1s and frac != all 0s

## Properties of Representation

1. float 0 are almost integer 0 (except the first bit)
2. can almost use unsigned integer comparison

3. but must first compare sign bit

4. must consider -0 = 0

5. not sure if NaN > inf

## Rounding

### Rounding to Even in Binary

Half: indicated by 10000000000... Even: indicated by 0 as least significant bit

DO:

• then round down
• else

• if what gets rounded is 1(round bit) 00000...(sticky bit)
• round to nearest
• if the lowest rounded bit (guard bit) is 0, keep it
• if the lowest rounded bit (guard bit) is 1, flip it
• else
• round up

## Multiplication

Sign: s1 ^ s2 Significand: M1 * M2

• if M >= 2, M(unsigned) >>= 1, E += 1 (if E overflow, infinity) Exponent: E1 + E2

## Properties

Arithmetic

• closed: Yes, but may generate infinity and NaN

• commutative: Yes

• Associative: No

• has additive inverse: Yes, but not for infinity and NaN

• monotonicity: except infinity and NaN

Casting

• double / float -> int

• truncate fractional part (rounding to 0)
• undefined when out of range or NaN (generally set to TMin)
• int -> double

• exact conversion as long as int has <53 word size
• int -> float

• will round

Table of Content