Limitation of MapReduce:
Good at: Simplified data analysis on large, unreliable clusters. Easy for user to write programs.
Bad at: Iterative instructions on same data.
Response Time:
L1: 1ns
Main memory: 100ns
SSD: 350 us
Same datacenter RTT: 500us
Same datacenter RRT is the same as SSD. Reading from disk compared to memory is 100x difference.
So to speed things up, we need to keep our computation in memory. But how do we do fault tolerance? Memory will go away after failure.
Instead of traditional file-sharing, Spark uses P2P
Traditional fault-tolerance approaches: too expensive 10~100x slowdown
Logging to persistent storage
Replicating data across nodes (ideally: also to persistent storage)
Checkpointing (checkpoints need to be stored persistently)
Lineage: a sequence of operations that has applied on a partition of data before failure. Individual operation on specific partition is cheap.
Storing lineage is much faster than checkpoint, since there is less data. But if lineage grows very large, there can be manual checkpointing on HDFS
Resilient Distributed Datasets (RDDs): intermediate results during each update, they must be deterministic functions of input
RDD Operations: coarse-grained operations (map
, groupBy
, filter
, sample
, flatMap
)
map
, filter
, sample
, groupByKey
, sortByKey
, union
, join
, cross
) - they are lazy evaluation to safe computationcount
, sum
, reduce
, save
, collect
)Immutability: RDD are immutable upon creation
// typical mapReduce on Spark
val lines = spark.textFile("hdfs://...") val
lineLengths = lines.map(s => s.length) val
totalLength = lineLengths.reduce((a, b) => a + b)
lineLengths.persist()
// log mining algorithm
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
messages = errors.map(_.split('\t')(2))
cachedMsgs = messages.persist()
cachedMsgs.filter(_.contains("foo")).count
..
cachedMsgs.filter(_.contains("bar")).count
// regression algorithm
var w = Vector.random(D-1)
for (i <- 1 to ITERATIONS) {
val gradient = data.map("math equation").reduce(_+_)
w -= gradient
}
My own notes about immutability: immutable applies to operation, never to data. Imagine you have a register storing
0x01
, a mutable operation is0x01.add(0x01);
which will result in0x02
. A immutable operation is0x01.set(0x02)
or0x01.set(read(0x01) + 0x01)
. The first one require us to somehow know the previous value, and the second one can't handle concurrent writes.
Apache Spark Deployment:
master server: linage, scheduling
cluster manager: resource allocation (Mesos, YARN, K8S)
worker: Executors isolate concurrent tasks, Caches persist RDDs
The design needs a lot of memory and high overhead (persistent data structure)
Bulk Synchronous Parallel (BSP) model: any distributed system can be emulated as local network + message passing
Spark, with 200 lines of code, can implement
Pregel
(graph computation),MapReduce
(general computation),Hive
(database)
Spark is bad for
fine-grain update and share state, non-batch workload
dataset don't fit to memory
if you need SIMD, GPU
ML nature:
lots of data
lots of parameter
lots of iteration
Overhead:
network communication
synchronization barrier
Optimize for
Custom hardware with many GPU on one motherboard.
Make sure
loss function is normalized with respect to total batch size
shuffle data every epoch (effect got correlated when distributed) and optimize for sequential read
do batch norm per GPU (already enough), not accross whole dataset / minibatch
make sure weight decay, momentum is right (do them separately from loss function)
Asynchronous Training: don't lock for backprop
Synchronous:
more stable result
(maybe) faster convergence
Bounded stale state by N steps: instead of synchronization at every step, and instead of no synchronization at all, we can bound the number of steps per synchronization
Asynchronous: Bulk synchronous parallel (BSP)
easier to implement correctly
easier to scale
faster per sample
Tensorflow works bad for synchronous parallism (2017), Pytorch is good for single machine synchronous parallism.
Table of Content