Hard errors: damaged component which experience fail-stop (bad solder, defective DRAM bank)
Soft errors: a flipped signal or bit, caused by external source or a faulty component (cosmic radiation, alpha particles)
Mean Time to Failure (MTTF)
Mean Time to Repair (MTTR)
Mean Time Between Failure (MTBF) = MTF + MTTR
Availability = MTTF / (MTTF + MTTR)
airlines: 99.9993%
911 phone service: 99.994%
standard phone: 99.99%
internet: 95%~99.6%
Avaliability Service Level Objective (SLO): An availability threshold which your system targets
Avaliability Service Level Agreement (SLA): An availability threshold that you guarantee for customers
Failure correlation: they are not correlated if they are physically separated (based on statistics)
Design Fault Tolerant Consideration
The probability of failure of each component
The cost of failure
The cost of implementing fault tolerance
Error Detection: timeout, parity, checksum
Error Correction: retry
Goals:
detect failure
correct failure
Error Detecting Code: general scheme
Single Bit Parity: cannot reliably detect multiple bit burst errors
Checksum: little better than Single Bit Parity since there are more bits
Simple to implement
Relatively weak detection
Still tricked by typical error patterns - e.g. burst errors
Cyclic Redundancy Check (CRC):
treat data D as polynomial coefficients
Can detect all burst errors less than r+1 bits
wide adoption
Two scheme of error recovery
redundancy (forward recovery)
retry (backward recovery)
Courses
15-750 Graduate Algorithms
15-853 Algorithms in the real world
15-848 Practical information and coding theory for computer systems
Triple modular redundancy:
Send the same request to 3 different instances of the system
Compare the answers, take the majority
Widely used in space application, since they have money to spend and high rate of failure due to cosmic rays.
We separate "detection" and "correction" by first detect the error and retry tranmission.
Hard drive: sequences of small data sectors with 4KB, operated by spinning disks
RAID: Use multiple disks to form single logical disk
Definitions
Reliability: # of disk failures we can tolerate
Latency: time to process Read/Write requests
Throughput: bandwidth for R/W requests
We assume random read write. We assume same throughput and latency for read write for all disks.
RAID Levels:
RAID 0: Data striping without redundancy
RAID 1: Mirroring of independent disks
RAID 2: ...
RAID 3: ...
RAID 4: Data striping plus parity disk
RAID 5: Data striping plus stripped (rotating) parity
RAID 6: ...
Mean Time To First Data Loss (MTTDL): calculate from MTTF
Sequential n device: MTTDL_n = MTTDL_1 / n
Parallel n device: MTTDL_n = \sum_{i = 1}^n \frac{MTTF_1}{i}
k Parity n data: MTTDL_n = \sum_{i = n+k}^{n} \frac{MTTF_1}{i}
From continuous time Markov Chain, we calculate MTTDL is around MTTF_{disk} / n with n disks (more n, more likely to fail). Derivation of MTTDL using Markov Chain can be found Here
restoring redundancy after failure: reconstruct when the first drive fails
Modes
Tradeoff
Methods
Table of Content