Lecture 015 - Spark

Spark

Inspiration fo Spark

Limitation of MapReduce:

Good at: Simplified data analysis on large, unreliable clusters. Easy for user to write programs.
Bad at: Iterative instructions on same data.
- MapReduce can't capture all computation
- Each iteration steps is small, and there are many iterations (90% spent on I/O to disks and over network, 10% spent computing actual results), bad for distributed deep learning

Response Time:

L1: 1ns
Main memory: 100ns
SSD: 350 us
Same datacenter RTT: 500us

Same datacenter RRT is the same as SSD. Reading from disk compared to memory is 100x difference.

So to speed things up, we need to keep our computation in memory. But how do we do fault tolerance? Memory will go away after failure.

Instead of traditional file-sharing, Spark uses P2P

Spark's Fault Tolerance

Traditional fault-tolerance approaches: too expensive 10~100x slowdown

Logging to persistent storage
Replicating data across nodes (ideally: also to persistent storage)
Checkpointing (checkpoints need to be stored persistently)

Lineage: a sequence of operations that has applied on a partition of data before failure. Individual operation on specific partition is cheap.

Storing lineage is much faster than checkpoint, since there is less data. But if lineage grows very large, there can be manual checkpointing on HDFS

Resilient Distributed Datasets (RDDs): intermediate results during each update, they must be deterministic functions of input

RDD Operations: coarse-grained operations (map, groupBy, filter, sample, flatMap)
- Transforms Operation: create new RDD from existing one (map, filter, sample, groupByKey, sortByKey, union, join, cross) - they are lazy evaluation to safe computation
- Actions: return value to caller (count, sum, reduce, save, collect)
- Persist: persist write RDD to memory
Immutability: RDD are immutable upon creation
- Enable us to track down lineage tree without recomputation all previous computations
- Allow us safely cache and share RDD accross Spark nodes
- compatible with append-only HDFS

// typical mapReduce on Spark
val lines = spark.textFile("hdfs://...") val
lineLengths = lines.map(s => s.length) val
totalLength = lineLengths.reduce((a, b) => a + b)
lineLengths.persist()

// log mining algorithm
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
messages = errors.map(_.split('\t')(2))
cachedMsgs = messages.persist()
cachedMsgs.filter(_.contains("foo")).count
..
cachedMsgs.filter(_.contains("bar")).count

// regression algorithm
var w = Vector.random(D-1)
  for (i <- 1 to ITERATIONS) {
  val gradient = data.map("math equation").reduce(_+_)
  w -= gradient
}

My own notes about immutability: immutable applies to operation, never to data. Imagine you have a register storing 0x01, a mutable operation is 0x01.add(0x01); which will result in 0x02. A immutable operation is 0x01.set(0x02) or 0x01.set(read(0x01) + 0x01). The first one require us to somehow know the previous value, and the second one can't handle concurrent writes.

Apache Spark Deployment:

master server: linage, scheduling
- Support for directed graphs of RDD operations
- Automatic pipelining of functions within a stage
- Partitioning/Cache-aware scheduling to minimizes shuffles
cluster manager: resource allocation (Mesos, YARN, K8S)
worker: Executors isolate concurrent tasks, Caches persist RDDs

The design needs a lot of memory and high overhead (persistent data structure)

Bulk Synchronous Parallel (BSP) model: any distributed system can be emulated as local network + message passing

Spark, with 200 lines of code, can implement Pregel (graph computation), MapReduce (general computation), Hive (database)

Spark is bad for

fine-grain update and share state, non-batch workload
dataset don't fit to memory
if you need SIMD, GPU

Distributed Deep Learning

ML nature:

lots of data
lots of parameter
lots of iteration

Overhead:

network communication
synchronization barrier

Optimize for

minimize traning time
maximize throughput
maximize concurrency
minimize data transfer
maximize batch size
maximize network depth
minimize latency

Custom hardware with many GPU on one motherboard.

8k mini-batch is good for Imagenet. When batch is too large, it doesn't get benifit from SGD (overfitting)

Make sure

loss function is normalized with respect to total batch size
shuffle data every epoch (effect got correlated when distributed) and optimize for sequential read
do batch norm per GPU (already enough), not accross whole dataset / minibatch
make sure weight decay, momentum is right (do them separately from loss function)

HOGWILD Architecture: blue servers compute gradient, send to green servers who holds network parameters and green servers update the network

Asynchronous Training: don't lock for backprop

Synchronous:

more stable result
(maybe) faster convergence

Bounded stale state by N steps: instead of synchronization at every step, and instead of no synchronization at all, we can bound the number of steps per synchronization

Asynchronous: Bulk synchronous parallel (BSP)

easier to implement correctly
easier to scale
faster per sample

Tensorflow works bad for synchronous parallism (2017), Pytorch is good for single machine synchronous parallism.

Table of Content