Here is a good note on GFS keypoints.
Google File System (GFS): very different file system by google, influenced serval cluster file system design
Apache Hadoop Distributed File System (HDFS): open-sourced version of GFS (widely deployed in companies)
Environment: Warehouse scale computer built out of large number of interconnected commodity servers
servers are on racks
each rack has a switch
there is a main switch, connected to other rack switch (hierarchy of switches)
Communicating within a rack has lower latency, higher bandwidth, no contention for bandwidth.
Since there are 100,000+ servers and 1,000,000 disks, failures are expected. (App bugs, OS bugs, Disk failures, Memory failures, Network failures, Power supply failures, Human error)
Unavailability event: server unresponsive for > 15 min, on average 0.005-0.01 of times per day.
GFS Workload Assumptions:
large files: \geq 100 MB
large streaming reads: \geq 1 MB
large sequential write
many concurrent appends (files used as producer-consumer queues)
(want atomic appends without synchronization overhead)
Map-reduce model do not change disk data often (only read, compute, write log), data are immutable on disk. So we need to design the file system optimized for sequential reads.
Design Goal:
high data, system avaliability
automatic failure handle
low synchronization overhead (exploit parallism)
high throughput is more important than low latency
filesystem should be co-designed with application (application-specific filesystem, e.g. map-reduce)
Hadoop and HDFS/GFS is smart enough to preserve data locality (same server or same rack) when doing data transfer for next step.
Master Server Stores and Do:
Metadata: hold in RAM for fast file system operations
Migrates chunks between chunkservers (balance disk utilization and diskspace usage across chunkservers.)
Controls consistency management
Garbage collects orphaned chunks (when a file is deleted)
Client Server:
stores file chunks of 64 MB in size on disk using ext4
linux file system (and version number, checksum), (but no file system interface at operating-system level, only user-level API, does not support all POSIX file system feature)
read, write, atomic append requests, snapshot
read(file name, chunk ID = chunk index)
read(chunk ID, byte range)
("closest" determined by IP address or ping
)Chunks replicated on configurable number of chunkservers (default: 3, a balance between fault tolerance and storage space)
No caching, streaming read (read once) and append write (write once) don't benefit from caching. Simplify client: caching makes things messy
heartbeats to master (avoid crashing)
when client server request data, ask chunk server for address (metadata) and go to other client of address to ask for it
GFS's Chunk replication: each chunk is replicated on multiple "chunkservers"
Master replication: write log and checkpoints replicated on multiple machines
each chunk has checksum
checksum verified for every read and write
checksum verified periodically for inactive chunks
chunk size is: 64 MB (other file system are usually in KB size, much smaller)
Colossus reduce the data chunk size from 64MB to 1MB.
Chunkserver:
Heartbeat: if no heartbeat, master reassign chunks
Version Number: Version stored at master and each chunkserver, version updated when there is new lease. When server fail, version becomes outdated after check with chunkserver. Outdated chunks are garbage collected.
Master:
store and (atomic) update all metadata about chunkserver (single point failure)
update WAL (write-update log) to disk and backup server sequentially
checkpoint: log can't be too long because master need entire log to rebuild filesystem state that exist only in memory, so we add checkpoints
when fail:
Consistency: GFS applications designed to accommodate the relaxed consistency model
Changes to metadata (namespace) are atomic.
Changes to data are ordered by a primary.
Concurrent writes can be overwritten.
Record appends is at least once, at offset of GFS’s choosing.
Applications must cope with possible duplicates
Failed append can cause inconsistency
Chunk replication guarantees high avaliability, check sum guarantees data integrity
Colossus: next generation GFS
Eliminates master node as single point of failure: Multiple/distributed masters
Improved storage efficiency: Employs erasure coding instead of 3 replicas (with 10 data chunks, we only need 4 parity chunks)
"A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes", IEEE ISIT 2013, IEEE Transactions on Information Theory, 2017.
"A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers", ACM SIGCOMM 2014.
"Erasure Coding in Windows Azure Storage", USENIX ATC, 2012.
“On the locality of codeword symbols”, Transactions on Information Theory, 2012.
Q: What is the difference between database and filesystem?
A: File system store files, datbase store more structured data. But people use HDFS like Amazon S3. File system assumes reading gigantic data, whereas database read specific data.
In real working environment, you rarely start re-design the system from ground up, because we just can't shutdown entire system. Also many people touched the project, and the system is complex, you only know your part's contract (the microservice you are in charge). Therefore it is impossible for end-to-end system design.
Difficulty to deal with:
failure due to hardware is fine, we restart system
malformed request trigger bugs that propagates to the entire system
since you can't shutdown any server, you need incremental non-breaking change one server by one server (so you have to always keep fields, don't change the type of field, changes need to be forward and backward compatible)
Verification: You test your system by write to old and testing server, and read from only old server. We compare between the old and testing version for months before deploy.
Progressive deployment: send small amount of traffic to test server. If fail, fallback to old server.
Migration/Upgrade: database migration between version just like Ethereum update. We only call upgrade when all nodes are ready to upgrade.
On-going hot ideas and tools: service mesh, cliphouse database
Table of Content