Lecture 012 - RAID

Measuring Failure

Hard errors: damaged component which experience fail-stop (bad solder, defective DRAM bank)

Soft errors: a flipped signal or bit, caused by external source or a faulty component (cosmic radiation, alpha particles)

Mean Time to Failure (MTTF)

Mean Time to Repair (MTTR)

Mean Time Between Failure (MTBF) = MTF + MTTR

Availability = MTTF / (MTTF + MTTR)

airlines: 99.9993%
911 phone service: 99.994%
standard phone: 99.99%
internet: 95%~99.6%

Avaliability Service Level Objective (SLO): An availability threshold which your system targets

Avaliability Service Level Agreement (SLA): An availability threshold that you guarantee for customers

Failure correlation: they are not correlated if they are physically separated (based on statistics)

Design Fault Tolerant Consideration

The probability of failure of each component
The cost of failure
The cost of implementing fault tolerance

Error Detection: timeout, parity, checksum

Error Correction: retry

Error Detection

Goals:

detect failure
correct failure

Error Detecting Code: general scheme

imagine a network transmission situration
sender send data $D$ , and hash $f(D)$
receiver check $D = f(D)$ upon receive

Single Bit Parity: cannot reliably detect multiple bit burst errors

given 7 data bits
we append 1 bit at the end calculated as the sum of 7 bits (mod 2)

Checksum: little better than Single Bit Parity since there are more bits

Simple to implement
Relatively weak detection
Still tricked by typical error patterns - e.g. burst errors

Cyclic Redundancy Check (CRC):

treat data $D$ as polynomial coefficients
- choose $r+1$ bits as coefficients of generator polynomial $G$ (send in advance)
- add $r$ bits to packet as CRC bits $R$
- so packet $(D, R)$ should be divisible by generator $G$
- can detect all error less than $r+1$ bits
Can detect all burst errors less than r+1 bits
wide adoption
- Efficient streaming implementation in hardware
- x86 instruction to calculate CRC
- used in ethernet and hard drives

Error Correction

Two scheme of error recovery

redundancy (forward recovery)
retry (backward recovery)

Error Correcting Codes (ECC)

Correcting one bit error with Two Dimensional Bit Parity

Courses

15-750 Graduate Algorithms
15-853 Algorithms in the real world
15-848 Practical information and coding theory for computer systems

Replication and Voting

Triple modular redundancy:

Send the same request to 3 different instances of the system
Compare the answers, take the majority

Widely used in space application, since they have money to spend and high rate of failure due to cosmic rays.

Retry

We separate "detection" and "correction" by first detect the error and retry tranmission.

Redundant Array of Inexpensive Disks (RAID)

Hard drive: sequences of small data sectors with 4KB, operated by spinning disks

RAID: Use multiple disks to form single logical disk

Definitions

Reliability: # of disk failures we can tolerate
Latency: time to process Read/Write requests
Throughput: bandwidth for R/W requests

We assume random read write. We assume same throughput and latency for read write for all disks.

RAID Levels:

RAID 0: Data striping without redundancy
- Interleave data across multiple disks for a file (no tolerance)
- Parallel read and write across multiple disks
- Poor reliability
RAID 1: Mirroring of independent disks
- make two or more copies of the same data (tolerate 1 disk failure)
- need to write in both, can read in either
- Poor capacity
RAID 2: ...
RAID 3: ...
RAID 4: Data striping plus parity disk
- ensure $D_1 \odot D_2 \odot D_3 \neq D_p, D_2 \odot D_3 \odot D_p = D_1$ assuming $D_1$ fails and $D_p$ is parity disk.
- when write, we need to update parity disk when we write (Parity disk can easily be a bottleneck)
- when read, we read to data disk
- Adding disk does not provide any performance gain
RAID 5: Data striping plus stripped (rotating) parity
- distribute parity disk to other disks
- Good compromise choice
RAID 6: ...

Mean Time To First Data Loss (MTTDL): calculate from MTTF

Sequential $n$ device: $MTTDL_n = MTTDL_1 / n$
Parallel $n$ device: $MTTDL_n = \sum_{i = 1}^n \frac{MTTF_1}{i}$
$k$ Parity $n$ data: $MTTDL_n = \sum_{i = n+k}^{n} \frac{MTTF_1}{i}$

From continuous time Markov Chain, we calculate MTTDL is around $MTTF_{disk} / n$ with $n$ disks (more $n$ , more likely to fail). Derivation of MTTDL using Markov Chain can be found Here

restoring redundancy after failure: reconstruct when the first drive fails

Modes
- Normal mode
- Degraded mode: some disk unavailable
- Rebuild mode: reconstructing lost disk’s contents onto spare
Tradeoff
- Rebuild is important for reliability
- Foreground activity is important for performance
Methods
- mirroring: just read a good copy
- parity: read from all drives and compute

Table of Content