Lecture 010 - Fault Tolerance

Some of the content of this note comes from 20 - Database Logging Schemes (CMU Databases Systems / Fall 2019) and 21 - ARIES Database Recovery (CMU Databases Systems / Fall 2019)

Fault Tolerance

Failure Model

Partial Failures: somehow working, but not reliable

Transient: appear once and then disappear
Intermittent: occurs, vanish, reappear with no real pattern
Permanent: require repair

Dependability:

Availability: ready to use in every moment (percent time up)
Reliability: run continuously without interruption (predictable interruption implies high reliability)
Safety: no permanent lose of state
Maintainability (Recovery): easy to repair quickly

Transection Failure: can't commit due to constraints, locks, or invariant.

Software Failure: software bug, segmentation fault

Hardware Failure: wire disconnection, flipping bits

Redundancy

Masking Failures by Redundancy:

Information Redundancy: recover partial state using extra data (Hamming Codes)
Time Redundancy: perform extra operations when you only need one simpler
Physical Redundancy: add duplicate hardware

Recovery: can break into two types

Backward Recovery: return to previous saved correct checkpoint state (e.g. resend packets)
- checkpoint can be expensive
Forward Recovery: go to correct new state directly using current information (e.g. erasure code)
- all potential error need to be accounted in every message
- harder to implement
- need mathematical structure

Checkpoint

Checkpoint: snapshot the state of the system to roll back

Checkpoint assume reliable storage

Independent vs. Coordinated

Independent checkpoint: each node takes its own checkpoint (Cascaded Rollback can be implement to eventually find a state that is consistent for all machines)
Coordinated checkpoint: all nodes are synchronized when a checkpoint is taken

2-phase Blocking Protocol

coordinator multicast CHECKPOINT requesting checkpoint
particopants stop performing application and save state and ACK
after state save send DONE and continue application

2-phase Blocking Protocol has potential failure: - a message is received by participant in the middle of CHECKPOINT sent and state save - a message is sent by participant in the middle of CHECKPOINT sent and state save - and other...

Logging and Recovery

DBMS: database management system. Need to ensure

once DBMS annoice a thing is committed, then it persist
once DBMS does not announce committed, nothing will be changed
these two promise still hold even it crashes in the middle of a transaction
announcing commit: means the database program return to function caller (function caller should generally wait for COMMIT announcement to ensure consistency)

Transaction: Can be parallel, start with START, end with COMMIT. Must be implemented to give "atomic" allusion.

We use UNDO and REDO to implement above promises.

Database Model

read from disk to memory
do calculation in memory
write to disk from memory

It is impossible to keep everything on disk to avoid crashes because they are slow. So we store enough data on disk to recover to a valid state after a crash.

Two transactions sharing a in-memory buffer pool for database

Buffer Pool: in-memory buffer pool of pages, can flush to disk

steal policy: whether allow dirty data in memory replacing the current data on disk before actually commit
force policy: whether we must write transaction to disk immediately right before announcing commit

The word "steal" means, for a transaction process $p$ sharing the buffer pool with other transactions, force other transactions to write their dirty data to disk, so that $p$ can "steal" more space of buffer pool from others.

The word "force" means the commit force writing on disk.

No steal and force: makes things reliable. But it requires buffer pool (write sets) not exceed physical memory.

when a transaction process $p$ commit, make a copy
in new copy: UNDO other transactions' non-committed changes
write the data in the copy, containing only $p$ 's changes to disk
if other transaction abort, then UNDO only in memory buffer pool

The simple approach above is rarely implemented in database system. A better approach is Shadow Paging

Shadow Paging

Implementing implement of no steal and force: maintaining two separate copies of DB on disk

master: contain only changes from committed transactions
shadow: dirty changes, only allow write to shadow page
commit: atomically switch between shadow and master
abort: discard the shadow page

Shadow paging ensure consistent rather than performance: - copy database is expensive - transaction can either batch copy-on-write or one at a time - fragmented data and random I/O (main reason)

So in reality, nobody use shadlow paging too (SQLite abandoned around 2010)

Write-Ahead Logging (WAL)

Write-Ahead Logging: maintain a log file separate from data file that contains all database change commands

log is stable storage
log has information to perform UNDO and REDO to restore database
log isn't meant to be human-readable (while it can be)
log is written before commit to disk (and it can use memory first and then write to disk before commit too)

It implement steal + no-force because we allow uncommitted change to write to disk because we can UNDO using log.

Log contains:

a <BEGIN> is written before every transaction start
a <COMMIT> is written before every transaction commit
a <DATA> is written for every change to a single object containing
- Transaction ID
- Object ID
- Before Value (for UNDO)
- After Value (for REDO)
a <CHECKPOINT> is written once a while, stopping all transaction, to keep log short. (You still need to check for un-committed change before <CHECKPOINT> line)
during recovery (REDO or UNDO), we also write to log so that we can handle failure during recovery too
a <TXN-END> is written after <COMMIT> is successfully written to disk (so in recovery, if we see this tag, we can safely ignore this transaction), we don't need to flush this before <COMMIT>.

WAL ensures performance more than consistent. It is more widely used.

One reason why we don't make log look like a copy of buffer pool is so that we can perform sequential write on log. We keep one log for all transactions.

When a log page in memory is full, we switch to a new log page in memory while waiting the previous log page to write to disk.

If you have read-only transaction, you generally don't need to keep a log.

Performance Tradeoff: No-force and steal is preferred in most applications

Shadow page ensures recovery performance is better than runtime performance (copy data is slow, read and switch data pointer is fast) while log ensures runtime performance is better than recovery performance (writing log is fast, but log replay is slower)

Actual Logging Implementation

Physical Logging: record bit changes log (git diff)
Logical Logging: record high level changes to log (UPDATE, DELETE, INSERT)
Physiological Logging: hybrid of the two (most commonly used)

Algorithms for Recovery and Isolation Exploiting Semantics (ARIES)

Name	Where	Definition
flushedLSN	Memory	Last LSN in log on disk (change every flush)
pageLSN	page	lastest LSN to page (change every record)
recLSN	page	first LSN that is dirty (change every flush)
lastLSN	Transaction	lastest LSN to txn (every txn's record)
MasterRecord	Disk	latest checkpoint (change every checkpoint)
prevLSN	Transaction	LSN pointer during reverse, per transaction
undoNext	Transaction	what to undo next during recovery (prevLSN)

Transaction Table (TT): in memory, store

transactionID that is active (not committed)
lastLSN of transactionID

Dirty Page Table (DPT): in memory store

pageID: ID of a dirty page
recLSN: first LSN that made page dirty

Compensation Log Record (CLR): the log of undo

undoNext: next transactionID that is about to undo
CLR never got UNDO, we only REDO them during recovery

ABORT

Announce ABORT immediately (Since if not yet written to disk, we are fine. If so by other transaction, we can revert easily. But if you want to read the result of abort, you need the following steps)
Locate lastLSN for transaction
UNDO update using prevLSN
when UNDO, write CLR with undoNext

Checkpoint for Clearing Log:

Traditional Checkpoint: wait until all transaction finishes, halt all incoming transaction and write CHECKPOINT (might take very long time to wait)
Better Checkpoint: store ATT, DPT status in checkpoint, pause transaction and do checkpoint
Fuzzy Checkpoint: write CHECKPOINT_BEGIN and CHECKPOINT_END, don't pause transaction

Recovery Phrases:

Analysis: scan through database to build TT and DPT
1. TNX-END: remove txn from ATT
2. UPDATE, UNDO: add txn to ATT (if not already in, set recLSN = LSN)
3. COMMIT: change status to COMMIT
4. CHECKPOINT_END: add ATT/DPT infomation of checkpoint to current ATT/DPT
5. ATT: all transaction that is active during crash
6. DPT: dirty pages that might not have made it to disk
Redo: redo everything (even for aborted transaction), restore to exact state when log saved before crash
1. Redo unless: affected page is not in DPT, or
2. affected page is in DPT, but LSN less than recLSN (already made into disk)
Undo: undo for transactions (only a portion of all transaction) that has net yet committed
1. When a transaction completely UNDO: sync to disk and don't need to UNDO again for next crash 1. write log to disk 2. write page to disk 3. append CLR: <TXN-END> to disk
2. If you have a CLR in log (that is not already <TNX-END>) during UNDO, don't UNDO it. Because you already REDO it in redo phrase. Instead, go to undoNext field in the record and start UNDO from there.

WAL/ARISE can work with 2PC where we need an additional log to capture 2PC behavior

Table of Content