In typical cluster, machines are mixed, and codes are high level.
Hadoop MapReduce API
Require: use Mapper & Reducer classes (Java)
Mapper: Code generates sequence of (k, v) pairs, given instruction on how to partition data
Sort: MapReduce’s built-in aggregation by key
Reducer: Given key + iterator that generates sequence of values
MapReduce: Simplified Data Processing on Large Clusters
Run Big Projects:
Choose Google's GFS or Hadoop's HDFS (built-in reliability via replication)
Break work into tasks
Load data in file system
Run program
Retrieve data in file system
MapReduce Provides Coarse-Grained Parallelism
Computation done by independent processes
File-based communication
Hadoop's Fault Tolerance
Dynamically scheduled: If a node fail, detect it with manager node (by heartbeat) and migrate job to other nodes.
Hadoop Project is important because big jobs were impossible due to increase probability of one failure when node size increase. Hadoop Project makes jobs done by:
- breaking into many short-lived tasks
- use disk storage to hold intermediate results to reschedule task in failure
Advantage of clusters:
can read write large dataset
can dynamically schedule tasks
can have consumer-grade components
can have heterogenous nodes
Disadvantage of clusters:
higher overhead
lower raw performance per node
Stragglers: Tasks that take long time to execute due to bugs, flaky hardware, poor partitioning. In this case, we detect and raise error.
When most of jobs are finished, we reschedule remaining tasks (if not buggy) to other nodes to reduce overall run time.
MPI vs. Map Reduce
Low vs High end
MPI:
complex programming model, hardware-specific code
allow tightly-coupled parallel tasks, good for iterative computation
failure handling handled at application
memory-based
good at graph computation and simulation where communication is frequent
MapReduce:
simple model and messaging system
good for loosely-coupled, coarse-grain parallel tasks
failure handling handled at framework, not application
mostly disk-based
bad at graph computation, bad at simulation where communication is frequent
Example: calculate popularity of a social network account
MapReduce: sucks
Google Pregel (MPI on graphs)
receive message: neighbors’ ranks
send message: our own rank (to all neighbors)
CMU Graphlab (shared state model)
emulate all nodes on the same machine
iterate over neighbors and access its rank (since very few vertices have high degrees, iterating over neighbors is always going to be slow)