Typicall HPC Machine: higher end than usual clusters
high end processors, a lot RAM
specialized network
RAID-based disk array
fast network with high bandwidth
Example: TaihuLight with 10,649,600 Cores, 1,310,720 GB memory, 93,014.6 TFlop/s. OakRidge Frontier with 8,730,112 specialized for Tensor computation (1,102,00 TFlop/s)
HPC Programming Model
Process communicate and synchronize via message passing
Fortran low level code written for hardware-dependent code
rely on small number of software packages
Typical HPC Operation
long-lived processes
care about spacial locality accross machines
all program and data in memory (no disks)
good at physical simulation (fields), which is parallelisable
barrier: dependency in computational graph
Message Passing Interface (MPI): Standardized communication protocol for programming parallel computers with functions like
Virtual topology
Synchronization
Communication
But for application-writers we don't need such low level as HPC...
Instead of using processes, we abstract communication as actors.
We can have multiple actors with mailbox in one application. They are not constrained by phiscal locality.
Typical Cluster
medium performance processors
some memory
a few disks
10~100GB/s Network
The network is slow compared to storage space. Therefore we want to move our data as little as possible.
In typical cluster: Application programs written in terms of high-level data operations, Runtime system controls scheduling, load balancing...
To compute word frequency in a book:
Before every step finishes, they typically persist previous job's state on disk to do failure recovery.
Hadoop Project: HDFS Fault Tolerance + MapReduce Programming Environment
In typical cluster, machines are mixed, and codes are high level.
Hadoop MapReduce API
Require: use Mapper & Reducer classes (Java)
Mapper: Code generates sequence of (k, v)
pairs, given instruction on how to partition data
Sort: MapReduce’s built-in aggregation by key
Reducer: Given key + iterator that generates sequence of values
Run Big Projects:
MapReduce Provides Coarse-Grained Parallelism
Computation done by independent processes
File-based communication
Dynamically scheduled: If a node fail, detect it with manager node (by heartbeat) and migrate job to other nodes.
Hadoop Project is important because big jobs were impossible due to increase probability of one failure when node size increase. Hadoop Project makes jobs done by: - breaking into many short-lived tasks - use disk storage to hold intermediate results to reschedule task in failure
Advantage of clusters:
can read write large dataset
can dynamically schedule tasks
can have consumer-grade components
can have heterogenous nodes
Disadvantage of clusters:
higher overhead
lower raw performance per node
Stragglers: Tasks that take long time to execute due to bugs, flaky hardware, poor partitioning. In this case, we detect and raise error.
When most of jobs are finished, we reschedule remaining tasks (if not buggy) to other nodes to reduce overall run time.
MPI:
complex programming model, hardware-specific code
allow tightly-coupled parallel tasks, good for iterative computation
failure handling handled at application
memory-based
good at graph computation and simulation where communication is frequent
MapReduce:
simple model and messaging system
good for loosely-coupled, coarse-grain parallel tasks
failure handling handled at framework, not application
mostly disk-based
bad at graph computation, bad at simulation where communication is frequent
Example: calculate popularity of a social network account
MapReduce: sucks
Google Pregel (MPI on graphs)
CMU Graphlab (shared state model)
Table of Content