Lecture 014 - GFS

Here is a good note on GFS keypoints.

Cluster Filesystem

Google File System (GFS): very different file system by google, influenced serval cluster file system design

Apache Hadoop Distributed File System (HDFS): open-sourced version of GFS (widely deployed in companies)

Operation Environment

Environment: Warehouse scale computer built out of large number of interconnected commodity servers

servers are on racks
each rack has a switch
there is a main switch, connected to other rack switch (hierarchy of switches)

Communicating within a rack has lower latency, higher bandwidth, no contention for bandwidth.

Since there are 100,000+ servers and 1,000,000 disks, failures are expected. (App bugs, OS bugs, Disk failures, Memory failures, Network failures, Power supply failures, Human error)

Unavailability event: server unresponsive for > 15 min, on average $0.005-0.01$ of times per day.

GFS Workload Assumptions:

large files: $\geq 100 MB$
large streaming reads: $\geq 1 MB$
large sequential write
many concurrent appends (files used as producer-consumer queues)
(want atomic appends without synchronization overhead)

Map-reduce model do not change disk data often (only read, compute, write log), data are immutable on disk. So we need to design the file system optimized for sequential reads.

Design Goal:

high data, system avaliability
automatic failure handle
low synchronization overhead (exploit parallism)
high throughput is more important than low latency
filesystem should be co-designed with application (application-specific filesystem, e.g. map-reduce)

Hadoop and HDFS/GFS is smart enough to preserve data locality (same server or same rack) when doing data transfer for next step.

GFS Architecture

Master Server Stores and Do:

Metadata: hold in RAM for fast file system operations
- namespace
- access control information
- mapping files to chunks
- location of chunks in chunkservers
Migrates chunks between chunkservers (balance disk utilization and diskspace usage across chunkservers.)
Controls consistency management
Garbage collects orphaned chunks (when a file is deleted)

Physical Daisy Chain (we are talking about logical in the following paragraph)

GFS Daisy-Chains: flow of data is decoupled from flow of control. This is to fully utilize each machine's network bandwidth, avoid network bottlenecks and high-latency links

Client Server:

stores file chunks of 64 MB in size on disk using ext4 linux file system (and version number, checksum), (but no file system interface at operating-system level, only user-level API, does not support all POSIX file system feature)
read, write, atomic append requests, snapshot
- Read:
  - Client sends master read(file name, chunk ID = chunk index)
  - Master’s reply: chunk ID, chunk version number, locations of replicas
  - Client sends request to “closest” chunkserver with replica: read(chunk ID, byte range) ("closest" determined by IP address or ping)
  - Chunkserver who has the data replies with data
- Write:
  - master decides placement of new chunks (two within single rack, third on different rack, access time and safety tradeoff)
  - each chunk has a "primary server", given through (renewable) lease (usually 60s) by master server
  - to write, client asks master for primary and secondary replicas for each chunk and cache response at client side, then client
    - send (async) data to replicas in daisy chain (each replica forward the same data to the next replica in chain as it receives it)
    - send (async) metadata (chunk handle, offset) of data to primary replica where the primary server establish an order which the data should be applied to each server (synchronization node)
    - the primary server send control message to all other replicas to make sure data are applied in the same order and wait for response
    - (see image above)
- Append:
  - Reason for Append beging separate operation:
    - Large files used as queues between multiple producers and consumers
    - We need append to be atomic to avoid synchronization overhead
  - Common case: data fit into one chunk
    - Primary appends data to own chunk replica
    - Primary tells secondaries to do same at same byte offset (byte offset created at primary server) in their chunk replicas // QUESTION: why at the same byte and also not consistent?
    - Primary replies with success to client
  - Rare case: data won’t fit in last chunk
    - Primary write padding into current chunk, start a new chunk
    - Primary instructs other replicas to do same
    - Primary replies to client, "retry on next chunk"
  - If append fail at any replica, client retries
    - If reties, some replicas will contain duplicated data
    - We need to make sure data is written to at least one replica
    - Replicas are not consistent // QUESTION: same as above
Chunks replicated on configurable number of chunkservers (default: 3, a balance between fault tolerance and storage space)
No caching, streaming read (read once) and append write (write once) don't benefit from caching. Simplify client: caching makes things messy
heartbeats to master (avoid crashing)
when client server request data, ask chunk server for address (metadata) and go to other client of address to ask for it
- so control message to master server
- data message directly to other chunkservers

Fault Tolerance

GFS's Chunk replication: each chunk is replicated on multiple "chunkservers"

Master replication: write log and checkpoints replicated on multiple machines
- there is only one master server to query
- but the information is replicated
each chunk has checksum
checksum verified for every read and write
checksum verified periodically for inactive chunks
chunk size is: 64 MB (other file system are usually in KB size, much smaller)
- Big chunk size advantage:
  - reduce the size of metadata since we have a lot of data.
  - lower load on master (less request) and network overhead
- Small chunk size advantage
  - small chunk size enables less retry overhead due to failure when replicating chunks
  - reading and writing operation can be separated among more chunk servers, reduce server load and increase throughput
  - better disk utilization: less fragmentation
  - easy integrity check (When a corrupted chunk is detected, it is easier to fix the chunk)

Colossus reduce the data chunk size from 64MB to 1MB.

Chunkserver:

Heartbeat: if no heartbeat, master reassign chunks
Version Number: Version stored at master and each chunkserver, version updated when there is new lease. When server fail, version becomes outdated after check with chunkserver. Outdated chunks are garbage collected.

Master:

store and (atomic) update all metadata about chunkserver (single point failure)
update WAL (write-update log) to disk and backup server sequentially
checkpoint: log can't be too long because master need entire log to rebuild filesystem state that exist only in memory, so we add checkpoints
when fail:
- replay log from disk (directory, file-to-chunk-ID mapping)
- ask chunkserver which chunks they hold (location of chunks)
- find maximum version number among all chunkserver (versio number)

Consistency: GFS applications designed to accommodate the relaxed consistency model

Changes to metadata (namespace) are atomic.
Changes to data are ordered by a primary.
Concurrent writes can be overwritten.
Record appends is at least once, at offset of GFS’s choosing.
Applications must cope with possible duplicates
Failed append can cause inconsistency

Chunk replication guarantees high avaliability, check sum guarantees data integrity

Post-GFS

Colossus: next generation GFS

Eliminates master node as single point of failure: Multiple/distributed masters
Improved storage efficiency: Employs erasure coding instead of 3 replicas (with 10 data chunks, we only need 4 parity chunks)
- Traditional erasure code: Reed-Solomon code
- Recent research: Apache Hadoop Distributed File System (HDFS) v3.0, Microsoft Azure

"A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes", IEEE ISIT 2013, IEEE Transactions on Information Theory, 2017.

"A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers", ACM SIGCOMM 2014.

"Erasure Coding in Windows Azure Storage", USENIX ATC, 2012.

“On the locality of codeword symbols”, Transactions on Information Theory, 2012.

Chris: How a Real System Look Like

Q: What is the difference between database and filesystem?

A: File system store files, datbase store more structured data. But people use HDFS like Amazon S3. File system assumes reading gigantic data, whereas database read specific data.

In real working environment, you rarely start re-design the system from ground up, because we just can't shutdown entire system. Also many people touched the project, and the system is complex, you only know your part's contract (the microservice you are in charge). Therefore it is impossible for end-to-end system design.

Difficulty to deal with:

failure due to hardware is fine, we restart system
malformed request trigger bugs that propagates to the entire system
since you can't shutdown any server, you need incremental non-breaking change one server by one server (so you have to always keep fields, don't change the type of field, changes need to be forward and backward compatible)
Verification: You test your system by write to old and testing server, and read from only old server. We compare between the old and testing version for months before deploy.
Progressive deployment: send small amount of traffic to test server. If fail, fallback to old server.
Migration/Upgrade: database migration between version just like Ethereum update. We only call upgrade when all nodes are ready to upgrade.

On-going hot ideas and tools: service mesh, cliphouse database

Table of Content