Lecture 015 - Spark

Spark

Inspiration fo Spark

Limitation of MapReduce:

Important Response Time

Important Response Time

Memory, Network, Disk

Memory, Network, Disk

Humanized Latency

Humanized Latency

Response Time:

Same datacenter RRT is the same as SSD. Reading from disk compared to memory is 100x difference.

Computation Models Other Than MapReduce

Computation Models Other Than MapReduce

So to speed things up, we need to keep our computation in memory. But how do we do fault tolerance? Memory will go away after failure.

P2P in Spark

P2P in Spark

Instead of traditional file-sharing, Spark uses P2P

Spark's Fault Tolerance

Traditional fault-tolerance approaches: too expensive 10~100x slowdown

Lineage

Lineage

Lineage: a sequence of operations that has applied on a partition of data before failure. Individual operation on specific partition is cheap.

Storing lineage is much faster than checkpoint, since there is less data. But if lineage grows very large, there can be manual checkpointing on HDFS

Resilient Distributed Datasets (RDDs): intermediate results during each update, they must be deterministic functions of input

// typical mapReduce on Spark
val lines = spark.textFile("hdfs://...") val
lineLengths = lines.map(s => s.length) val
totalLength = lineLengths.reduce((a, b) => a + b)
lineLengths.persist()
// log mining algorithm
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
messages = errors.map(_.split('\t')(2))
cachedMsgs = messages.persist()
cachedMsgs.filter(_.contains("foo")).count
..
cachedMsgs.filter(_.contains("bar")).count
// regression algorithm
var w = Vector.random(D-1)
  for (i <- 1 to ITERATIONS) {
  val gradient = data.map("math equation").reduce(_+_)
  w -= gradient
}

My own notes about immutability: immutable applies to operation, never to data. Imagine you have a register storing 0x01, a mutable operation is 0x01.add(0x01); which will result in 0x02. A immutable operation is 0x01.set(0x02) or 0x01.set(read(0x01) + 0x01). The first one require us to somehow know the previous value, and the second one can't handle concurrent writes.

Apache Spark Deployment

Apache Spark Deployment

Spark dependency graph computation

Spark dependency graph computation

Apache Spark Deployment:

The design needs a lot of memory and high overhead (persistent data structure)

Bulk Synchronous Parallel (BSP) model: any distributed system can be emulated as local network + message passing

Spark, with 200 lines of code, can implement Pregel (graph computation), MapReduce (general computation), Hive (database)

Spark is bad for

Distributed Deep Learning

ML nature:

Overhead:

Optimize for

  1. minimize traning time
  2. maximize throughput
  3. maximize concurrency
  4. minimize data transfer
  5. maximize batch size
  6. maximize network depth
  7. minimize latency

Custom hardware with many GPU on one motherboard.

8k mini-batch is good for Imagenet. When batch is too large, it doesn't get benifit from SGD (overfitting)

8k mini-batch is good for Imagenet. When batch is too large, it doesn't get benifit from SGD (overfitting)

Make sure

HOGWILD Algorithm

HOGWILD Algorithm

HOGWILD Architecture: blue servers compute gradient, send to green servers who holds network parameters and green servers update the network

HOGWILD Architecture: blue servers compute gradient, send to green servers who holds network parameters and green servers update the network

Asynchronous Training: don't lock for backprop

Asynchronous Training: tradeoff

Asynchronous Training: tradeoff

Synchronous:

Bounded stale state by N steps

Bounded stale state by N steps

Bounded stale state by N steps: instead of synchronization at every step, and instead of no synchronization at all, we can bound the number of steps per synchronization

Asynchronous: Bulk synchronous parallel (BSP)

Tensorflow works bad for synchronous parallism (2017), Pytorch is good for single machine synchronous parallism.

Challenges remaining as of 2022

Challenges remaining as of 2022

Table of Content