Lecture 010 - Fault Tolerance

Some of the content of this note comes from 20 - Database Logging Schemes (CMU Databases Systems / Fall 2019) and 21 - ARIES Database Recovery (CMU Databases Systems / Fall 2019)

Fault Tolerance

Failure Model

Partial Failures: somehow working, but not reliable


Types of Failure

Types of Failure

Transection Failure: can't commit due to constraints, locks, or invariant.

Software Failure: software bug, segmentation fault

Hardware Failure: wire disconnection, flipping bits


Masking Failures by Redundancy:

Redundancy in Electrical Circuits

Redundancy in Electrical Circuits

Recovery: can break into two types


Checkpoint: snapshot the state of the system to roll back

Checkpoint assume reliable storage

Independent vs. Coordinated

2-phase Blocking Protocol

2-phase Blocking Protocol

2-phase Blocking Protocol

2-phase Blocking Protocol has potential failure: - a message is received by participant in the middle of CHECKPOINT sent and state save - a message is sent by participant in the middle of CHECKPOINT sent and state save - and other...

Logging and Recovery

DBMS: database management system. Need to ensure

Transaction: Can be parallel, start with START, end with COMMIT. Must be implemented to give "atomic" allusion.

We use UNDO and REDO to implement above promises.

Database Model

  1. read from disk to memory
  2. do calculation in memory
  3. write to disk from memory

It is impossible to keep everything on disk to avoid crashes because they are slow. So we store enough data on disk to recover to a valid state after a crash.

Two transactions sharing a in-memory buffer pool for database

Two transactions sharing a in-memory buffer pool for database

Buffer Pool: in-memory buffer pool of pages, can flush to disk

The word "steal" means, for a transaction process p sharing the buffer pool with other transactions, force other transactions to write their dirty data to disk, so that p can "steal" more space of buffer pool from others.

The word "force" means the commit force writing on disk.

No Steal and Foce

No Steal and Foce

No steal and force: makes things reliable. But it requires buffer pool (write sets) not exceed physical memory.

  1. when a transaction process p commit, make a copy
  2. in new copy: UNDO other transactions' non-committed changes
  3. write the data in the copy, containing only p's changes to disk
  4. if other transaction abort, then UNDO only in memory buffer pool

The simple approach above is rarely implemented in database system. A better approach is Shadow Paging

Shadow Paging

Implementing implement of no steal and force: maintaining two separate copies of DB on disk

Shadow paging ensure consistent rather than performance: - copy database is expensive - transaction can either batch copy-on-write or one at a time - fragmented data and random I/O (main reason)

So in reality, nobody use shadlow paging too (SQLite abandoned around 2010)

Write-Ahead Logging (WAL)

Write-Ahead Logging: maintain a log file separate from data file that contains all database change commands

It implement steal + no-force because we allow uncommitted change to write to disk because we can UNDO using log.

Database Logging

Database Logging

Log contains:

WAL ensures performance more than consistent. It is more widely used.

One reason why we don't make log look like a copy of buffer pool is so that we can perform sequential write on log. We keep one log for all transactions.

When a log page in memory is full, we switch to a new log page in memory while waiting the previous log page to write to disk.

If you have read-only transaction, you generally don't need to keep a log.

Performance Tradeoff: No-force and steal is preferred in most applications

Performance Tradeoff: No-force and steal is preferred in most applications

Shadow page ensures recovery performance is better than runtime performance (copy data is slow, read and switch data pointer is fast) while log ensures runtime performance is better than recovery performance (writing log is fast, but log replay is slower)

Logging Schemes

Logging Schemes

Actual Logging Implementation

Algorithms for Recovery and Isolation Exploiting Semantics (ARIES)

Name Where Definition
flushedLSN Memory Last LSN in log on disk (change every flush)
pageLSN page lastest LSN to page (change every record)
recLSN page first LSN that is dirty (change every flush)
lastLSN Transaction lastest LSN to txn (every txn's record)
MasterRecord Disk latest checkpoint (change every checkpoint)
prevLSN Transaction LSN pointer during reverse, per transaction
undoNext Transaction what to undo next during recovery (prevLSN)

Transaction Table (TT): in memory, store

Dirty Page Table (DPT): in memory store

Compensation Log Record during ABORT

Compensation Log Record during ABORT

Compensation Log Record (CLR): the log of undo


  1. Announce ABORT immediately (Since if not yet written to disk, we are fine. If so by other transaction, we can revert easily. But if you want to read the result of abort, you need the following steps)
  2. Locate lastLSN for transaction
  3. UNDO update using prevLSN
  4. when UNDO, write CLR with undoNext

Checkpoint for Clearing Log:

Recovery Phrases

Recovery Phrases

Analysis Phrase

Analysis Phrase

Recovery Phrases:

  1. Analysis: scan through database to build TT and DPT
    1. TNX-END: remove txn from ATT
    2. UPDATE, UNDO: add txn to ATT (if not already in, set recLSN = LSN)
    3. COMMIT: change status to COMMIT
    4. CHECKPOINT_END: add ATT/DPT infomation of checkpoint to current ATT/DPT
    5. ATT: all transaction that is active during crash
    6. DPT: dirty pages that might not have made it to disk
  2. Redo: redo everything (even for aborted transaction), restore to exact state when log saved before crash
    1. Redo unless: affected page is not in DPT, or
    2. affected page is in DPT, but LSN less than recLSN (already made into disk)
  3. Undo: undo for transactions (only a portion of all transaction) that has net yet committed
    1. When a transaction completely UNDO: sync to disk and don't need to UNDO again for next crash 1. write log to disk 2. write page to disk 3. append CLR: <TXN-END> to disk
    2. If you have a CLR in log (that is not already <TNX-END>) during UNDO, don't UNDO it. Because you already REDO it in redo phrase. Instead, go to undoNext field in the record and start UNDO from there.

WAL/ARISE can work with 2PC where we need an additional log to capture 2PC behavior

Table of Content