Lecture 014 - GFS

Here is a good note on GFS keypoints.

Cluster Filesystem

Google File System (GFS): very different file system by google, influenced serval cluster file system design

Apache Hadoop Distributed File System (HDFS): open-sourced version of GFS (widely deployed in companies)

Operation Environment

Environment: Warehouse scale computer built out of large number of interconnected commodity servers

Communicating within a rack has lower latency, higher bandwidth, no contention for bandwidth.

Since there are 100,000+ servers and 1,000,000 disks, failures are expected. (App bugs, OS bugs, Disk failures, Memory failures, Network failures, Power supply failures, Human error)

Unavailability event: server unresponsive for > 15 min, on average 0.005-0.01 of times per day.

GFS Workload Assumptions:

Map-reduce model do not change disk data often (only read, compute, write log), data are immutable on disk. So we need to design the file system optimized for sequential reads.

Design Goal:

Hadoop and HDFS/GFS is smart enough to preserve data locality (same server or same rack) when doing data transfer for next step.

GFS Architecture

High-level GFS Architecture

High-level GFS Architecture

Master Server Stores and Do:

Physical Daisy Chain (we are talking about logical in the following paragraph)

Physical Daisy Chain (we are talking about logical in the following paragraph)

GFS Daisy-Chains: flow of data is decoupled from flow of control. This is to fully utilize each machine's network bandwidth, avoid network bottlenecks and high-latency links

GFS Daisy-Chains: flow of data is decoupled from flow of control. This is to fully utilize each machine's network bandwidth, avoid network bottlenecks and high-latency links

Client Server:

Fault Tolerance

GFS's Chunk replication: each chunk is replicated on multiple "chunkservers"

Colossus reduce the data chunk size from 64MB to 1MB.

Chunkserver:

Master:

Consistency: GFS applications designed to accommodate the relaxed consistency model

Chunk replication guarantees high avaliability, check sum guarantees data integrity

Post-GFS

Colossus: next generation GFS

"A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes", IEEE ISIT 2013, IEEE Transactions on Information Theory, 2017.

"A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers", ACM SIGCOMM 2014.

"Erasure Coding in Windows Azure Storage", USENIX ATC, 2012.

“On the locality of codeword symbols”, Transactions on Information Theory, 2012.

Erasure Code vs Replication

Erasure Code vs Replication

Chris: How a Real System Look Like

Q: What is the difference between database and filesystem?

A: File system store files, datbase store more structured data. But people use HDFS like Amazon S3. File system assumes reading gigantic data, whereas database read specific data.

In real working environment, you rarely start re-design the system from ground up, because we just can't shutdown entire system. Also many people touched the project, and the system is complex, you only know your part's contract (the microservice you are in charge). Therefore it is impossible for end-to-end system design.

Difficulty to deal with:

On-going hot ideas and tools: service mesh, cliphouse database

Table of Content