# Lecture 012

## Measuring Failure

Hard errors: damaged component which experience fail-stop (bad solder, defective DRAM bank)

Soft errors: a flipped signal or bit, caused by external source or a faulty component (cosmic radiation, alpha particles)

Mean Time to Failure (MTTF)

Mean Time to Repair (MTTR)

Mean Time Between Failure (MTBF) = MTF + MTTR

Availability = MTTF / (MTTF + MTTR)

• airlines: 99.9993%

• 911 phone service: 99.994%

• standard phone: 99.99%

• internet: 95%~99.6%

Avaliability Service Level Objective (SLO): An availability threshold which your system targets

Avaliability Service Level Agreement (SLA): An availability threshold that you guarantee for customers

Failure correlation: they are not correlated if they are physically separated (based on statistics)

Design Fault Tolerant Consideration

• The probability of failure of each component

• The cost of failure

• The cost of implementing fault tolerance

## Error Detection

Goals:

• detect failure

• correct failure

Error Detecting Code: general scheme

1. imagine a network transmission situration
2. sender send data $D$, and hash $f(D)$
3. receiver check $D = f(D)$ upon receive

Single Bit Parity: cannot reliably detect multiple bit burst errors

1. given 7 data bits
2. we append 1 bit at the end calculated as the sum of 7 bits (mod 2)

Checksum: little better than Single Bit Parity since there are more bits

• Simple to implement

• Relatively weak detection

• Still tricked by typical error patterns - e.g. burst errors

Cyclic Redundancy Check (CRC)

• Can detect all burst errors less than r+1 bits

• Efficient streaming implementation in hardware
• x86 instruction to calculate CRC
• used in ethernet and hard drives
• // TODO 3 slides

## Error Correction

Two scheme of error recovery

• redundancy (forward recovery)

• retry (backward recovery)

### Error Correcting Codes (ECC)

Courses

• 15-853 Algorithms in the real world

• 15-848 Practical information and coding theory for computer systems

### Replication and Voting

Triple modular redundancy:

• Send the same request to 3 different instances of the system

• Compare the answers, take the majority

Widely used in space application, since they have money to spend and high rate of failure due to cosmic rays.

### Retry

We separate "detection" and "correction" by first detect the error and retry tranmission.

## Redundant Array of Inexpensive Disks (RAID)

Hard drive: sequences of small data sectors with 4KB, operated by spinning disks

RAID: Use multiple disks to form single logical disk

Definitions

• Reliability: # of disk failures we can tolerate

• Latency: time to process Read/Write requests

• Throughput: bandwidth for R/W requests

We assume random read write. We assume same throughput and latency for read write for all disks.

RAID Levels:

• RAID 0: Data striping without redundancy

• Interleave data across multiple disks for a file (no tolerance)
• Parallel read and write across multiple disks
• RAID 1: Mirroring of independent disks

• make two or more copies of the same data (tolerate 1 disk failure)
• need to write in both, can read in either
• RAID 2: ...

• RAID 3: ...

• RAID 4: Data striping plus parity disk

• ensure $D_1 \odot D_2 \odot D_3 \neq D_p, D_2 \odot D_3 \odot D_p = D_1$ assuming $D_1$ fails and $D_p$ is parity disk.
• when write, we need to update parity disk when we write (Parity disk can easily be a bottleneck)
• Adding disk does not provide any performance gain
• RAID 5: Data striping plus stripped (rotating) parity

• distribute parity disk to other disks
• RAID 6: ...

Mean Time To First Data Loss (MTTDL): calculate from MTTF

From continuous time Markov Chain, we calculate MTTDL is around $MTTF_{disk} / n$ with $n$ disks (more $n$, more likely to fail).

// TODO: calculation

// TODO: rest of slides

Table of Content