Lecture 013 - MPI, MapReduce, Hadoop

Cluster Computing: MPI & Map Reduce

High Performance Computing

Typicall HPC Machine: higher end than usual clusters

high end processors, a lot RAM
specialized network
RAID-based disk array
fast network with high bandwidth

Example: TaihuLight with 10,649,600 Cores, 1,310,720 GB memory, 93,014.6 TFlop/s. OakRidge Frontier with 8,730,112 specialized for Tensor computation (1,102,00 TFlop/s)

HPC Programming Model

Process communicate and synchronize via message passing
Fortran low level code written for hardware-dependent code
rely on small number of software packages

Typical HPC Operation

long-lived processes
care about spacial locality accross machines
all program and data in memory (no disks)
good at physical simulation (fields), which is parallelisable

Message Passing Interface (MPI)

barrier: dependency in computational graph

Message Passing Interface (MPI): Standardized communication protocol for programming parallel computers with functions like

Virtual topology
Synchronization
Communication

Standardized set of group communication methods

But for application-writers we don't need such low level as HPC...

Actor Model

Instead of using processes, we abstract communication as actors.

We can have multiple actors with mailbox in one application. They are not constrained by phiscal locality.

Typical Cluster

medium performance processors
some memory
a few disks
10~100GB/s Network

The network is slow compared to storage space.
Time takes to read/write or to transfer 10 TB data

Therefore we want to move our data as little as possible.

In typical cluster: Application programs written in terms of high-level data operations, Runtime system controls scheduling, load balancing...

To compute word frequency in a book:

Map: break the book into smaller sections, distribute to each node
Compute: each node computes word frequency dictionary $(k, v)$
Sort: each node send to node $k$ data $v$
Reduce: node $k$ s combine their data with node $k'$ .

Before every step finishes, they typically persist previous job's state on disk to do failure recovery.

Hadoop Project

Hadoop's Map Reduce

Hadoop Project: HDFS Fault Tolerance + MapReduce Programming Environment

In typical cluster, machines are mixed, and codes are high level.

Hadoop MapReduce API

Require: use Mapper & Reducer classes (Java)
Mapper: Code generates sequence of (k, v) pairs, given instruction on how to partition data
Sort: MapReduce’s built-in aggregation by key
Reducer: Given key + iterator that generates sequence of values

MapReduce: Simplified Data Processing on Large Clusters

Run Big Projects:

Choose Google's GFS or Hadoop's HDFS (built-in reliability via replication)
Break work into tasks
Load data in file system
Run program
Retrieve data in file system

MapReduce Provides Coarse-Grained Parallelism

Computation done by independent processes
File-based communication

Hadoop's Fault Tolerance

Dynamically scheduled: If a node fail, detect it with manager node (by heartbeat) and migrate job to other nodes.

Hadoop Project is important because big jobs were impossible due to increase probability of one failure when node size increase. Hadoop Project makes jobs done by: - breaking into many short-lived tasks - use disk storage to hold intermediate results to reschedule task in failure

Advantage of clusters:

can read write large dataset
can dynamically schedule tasks
can have consumer-grade components
can have heterogenous nodes

Disadvantage of clusters:

higher overhead
lower raw performance per node

Stragglers: Tasks that take long time to execute due to bugs, flaky hardware, poor partitioning. In this case, we detect and raise error.

When most of jobs are finished, we reschedule remaining tasks (if not buggy) to other nodes to reduce overall run time.

MPI vs. Map Reduce

MPI:

complex programming model, hardware-specific code
allow tightly-coupled parallel tasks, good for iterative computation
failure handling handled at application
memory-based
good at graph computation and simulation where communication is frequent

MapReduce:

simple model and messaging system
good for loosely-coupled, coarse-grain parallel tasks
failure handling handled at framework, not application
mostly disk-based
bad at graph computation, bad at simulation where communication is frequent

Example: calculate popularity of a social network account

MapReduce: sucks
Google Pregel (MPI on graphs)
- receive message: neighbors’ ranks
- send message: our own rank (to all neighbors)
CMU Graphlab (shared state model)
- emulate all nodes on the same machine
- iterate over neighbors and access its rank (since very few vertices have high degrees, iterating over neighbors is always going to be slow)

Table of Content