Lecture 012

Measuring Failure

Disk Failure is High (Schroeder '07)

Disk Failure is High (Schroeder '07)

Hard errors: damaged component which experience fail-stop (bad solder, defective DRAM bank)

Soft errors: a flipped signal or bit, caused by external source or a faulty component (cosmic radiation, alpha particles)

Failure Measurements

Failure Measurements

Mean Time to Failure (MTTF)

Mean Time to Repair (MTTR)

Mean Time Between Failure (MTBF) = MTF + MTTR

Avaliability in Perspective

Avaliability in Perspective

Availability = MTTF / (MTTF + MTTR)

Avaliability Service Level Objective (SLO): An availability threshold which your system targets

Avaliability Service Level Agreement (SLA): An availability threshold that you guarantee for customers

Failure distribution in time

Failure distribution in time

Failure correlation: they are not correlated if they are physically separated (based on statistics)

Design Fault Tolerant Consideration

Error Detection

Goals:

Error Detecting Code: general scheme

  1. imagine a network transmission situration
  2. sender send data D, and hash f(D)
  3. receiver check D = f(D) upon receive

Single Bit Parity: cannot reliably detect multiple bit burst errors

  1. given 7 data bits
  2. we append 1 bit at the end calculated as the sum of 7 bits (mod 2)

Checksum: little better than Single Bit Parity since there are more bits

Cyclic Redundancy Check (CRC)

Error Correction

Two scheme of error recovery

Error Correcting Codes (ECC)

Correcting one bit error with Two Dimensional Bit Parity

Correcting one bit error with Two Dimensional Bit Parity

Courses

Replication and Voting

Triple modular redundancy:

Widely used in space application, since they have money to spend and high rate of failure due to cosmic rays.

Retry

We separate "detection" and "correction" by first detect the error and retry tranmission.

Redundant Array of Inexpensive Disks (RAID)

Hard drive: sequences of small data sectors with 4KB, operated by spinning disks

RAID: Use multiple disks to form single logical disk

Definitions

We assume random read write. We assume same throughput and latency for read write for all disks.

Single Disk Analyze

Single Disk Analyze

RAID Levels:

RAID 0

RAID 0

RAID 1

RAID 1

RAID Parity Disk

RAID Parity Disk

RAID 4

RAID 4

RAID 5

RAID 5

RAID Summery

RAID Summery

Mean Time To First Data Loss (MTTDL): calculate from MTTF

From continuous time Markov Chain, we calculate MTTDL is around MTTF_{disk} / n with n disks (more n, more likely to fail).

// TODO: calculation

// TODO: rest of slides

Table of Content