Review 002


Mean Time to Failure (MTTF), Mean Time to Repair (MTTR), Mean Time Between Failure (MTBF) = MTF + MTTR, Availability = MTTF / (MTTF + MTTR)

Error Detection: timeout, parity, checksum, Cyclic Redundancy Check (CRC: treat data as polynomial coefficients)

Error Correction: retry, Error Correcting Codes (ECC: two dimensional parity bits), replication

RAID Summery

RAID Summery

Mean Time To First Data Loss (MTTDL): calculate from MTTF

RAID Levels:

MPI, Hadoop

MapReduce: map, compute, sort, reduce

Message Passing Interface (MPI): Standardized communication protocol (lower level used in HPC), hardware-specific code, coupled parallel tasks, good for iterative, application-level failure tolerance, memory-based, graph computation

MapReduce: simple model and messaging system, good for loosely-coupled, coarse-grain parallel tasks, failure handling handled at framework, not application. mostly disk-based. bad at graph computation, bad at simulation where communication is frequent

Hadoop Project: HDFS Fault Tolerance + MapReduce Programming Environment (Coarse-Grained Parallelism)


Google File System (GFS): large files: \geq 100 MB. large streaming reads: \geq 1 MB, large sequential write, many concurrent appends (files used as producer-consumer queues), (want atomic appends without synchronization overhead), high throughput is more important than low latency (although erasure code / parity chunk is better)

Apache Hadoop Distributed File System (HDFS): open-sourced version of GFS (widely deployed in companies)

Master Server: one master, but duplicated master data

Client Server:

Consistency: GFS applications designed to accommodate the relaxed consistency model (Concurrent writes can be overwritten, ordered by a primary), Record appends is at least once, Possible duplicates


Resilient Distributed Datasets (RDDs): intermediate results during each update, they must be deterministic functions of input

Apache Spark Deployment:

Bulk Synchronous Parallel (BSP) model: any distributed system can be emulated as local network + message passing

Spark is bad for: fine-grain update and share state, non-batch workload. dataset don't fit to memory, don't have disk, if you need SIMD, GPU

Bounded stale state by N steps: instead of synchronization at every step, and instead of no synchronization at all, we can bound the number of steps per synchronization

Asynchronous: Bulk synchronous parallel (BSP), easier to implement correctly. easier to scale. faster per sample


Root: high TTL, Iterative (instead of Recursive DNS Query)

DNS-based Routing

Load-balancer: round robin load balancer, static partition, hash-based partition, consistent hashing (key always hash to next server in a circle, problem: 2 server very close, or 1 server leave. solution: virtual node)


Records: key (optional), value, timestamp


Authentication, Confidentiality, Integrity, Avaliability - Tools:


Example Quorums under Byzantine Faults: (with fail-stop, we only need 2f+1 nodes, for liveness, Quorum size is at most N - f, so for Quorum size of N-f, overlap region can be calculated with (N-f) + (N-f) - N. We want this to be \geq f + 1 as stated above, giving N \geq 3f + 1

Impossibility: No solution with fewer than 3f + 1 generals can cope with f traitors.

Algorithm: using a Replicated State Machine (RSM) with 3f+1 replicas

  1. client send opcode to Primary
  2. Primary broadcast Pre-prepare<viewNumber, sequenceNumber, opcode> (and put it in log)
  3. Replicas receive broadcast, determine if Pre-prepare<> is valid, if so, broadcast Prepare<replicaID, viewNumber, sequenceNumber, opcode> (and put it in log), wait until 2f+1 Prepare<> from other replicas doing the same thing
    1. crypto signature is valid
    2. having the same viewNumber in message compared to stored
    3. has not accepted other Pre-prepare<> with the same viewNumber
    4. has not accepted the same sequenceNumber
    5. Above ensure if <opcode1, viewNumber, sequenceNumber, replicaID1> in []log, then there is no <opcode2, viewNumber, sequenceNumber, replicaID2> in []log
    6. Above ensure: All honest nodes that are prepared have the same opcode
    7. Above ensure: At least f+1 honest nodes have sent Prepare<> and Pre-prepare<>
  4. Replicas received 2f+1 Prepare<>, send Commit<replicaID, viewNumber, sequenceNumber, opcode>, wait until 2f+1 Commit<> from other replicas doing the same thing
  5. Replicas received 2f+1 Commit<> (if we assume viewNumber can change, then we only need f+1), (put it in log), send result to client
  6. Client waits for f+1 matching replies before commit

Table of Content