Some of the content of this note comes from 20 - Database Logging Schemes (CMU Databases Systems / Fall 2019) and 21 - ARIES Database Recovery (CMU Databases Systems / Fall 2019)
Partial Failures: somehow working, but not reliable
Transient: appear once and then disappear
Intermittent: occurs, vanish, reappear with no real pattern
Permanent: require repair
Dependability:
Availability: ready to use in every moment (percent time up)
Reliability: run continuously without interruption (predictable interruption implies high reliability)
Safety: no permanent lose of state
Maintainability (Recovery): easy to repair quickly
Transection Failure: can't commit due to constraints, locks, or invariant.
Software Failure: software bug, segmentation fault
Hardware Failure: wire disconnection, flipping bits
Masking Failures by Redundancy:
Information Redundancy: recover partial state using extra data (Hamming Codes)
Time Redundancy: perform extra operations when you only need one simpler
Physical Redundancy: add duplicate hardware
Recovery: can break into two types
Backward Recovery: return to previous saved correct checkpoint state (e.g. resend packets)
Forward Recovery: go to correct new state directly using current information (e.g. erasure code)
Checkpoint: snapshot the state of the system to roll back
Checkpoint assume reliable storage
Independent vs. Coordinated
Independent checkpoint: each node takes its own checkpoint (Cascaded Rollback can be implement to eventually find a state that is consistent for all machines)
Coordinated checkpoint: all nodes are synchronized when a checkpoint is taken
2-phase Blocking Protocol
coordinator multicast CHECKPOINT
requesting checkpoint
particopants stop performing application and save state and ACK
after state save send DONE
and continue application
2-phase Blocking Protocol has potential failure: - a message is received by participant in the middle of
CHECKPOINT
sent and state save - a message is sent by participant in the middle ofCHECKPOINT
sent and state save - and other...
DBMS: database management system. Need to ensure
once DBMS annoice a thing is committed, then it persist
once DBMS does not announce committed, nothing will be changed
these two promise still hold even it crashes in the middle of a transaction
announcing commit: means the database program return to function caller (function caller should generally wait for COMMIT
announcement to ensure consistency)
Transaction: Can be parallel, start with START
, end with COMMIT
. Must be implemented to give "atomic" allusion.
We use
UNDO
andREDO
to implement above promises.
Database Model
It is impossible to keep everything on disk to avoid crashes because they are slow. So we store enough data on disk to recover to a valid state after a crash.
Buffer Pool: in-memory buffer pool of pages, can flush to disk
steal policy: whether allow dirty data in memory replacing the current data on disk before actually commit
force policy: whether we must write transaction to disk immediately right before announcing commit
The word "steal" means, for a transaction process p sharing the buffer pool with other transactions, force other transactions to write their dirty data to disk, so that p can "steal" more space of buffer pool from others.
The word "force" means the commit force writing on disk.
No steal and force: makes things reliable. But it requires buffer pool (write sets) not exceed physical memory.
UNDO
other transactions' non-committed changesUNDO
only in memory buffer poolThe simple approach above is rarely implemented in database system. A better approach is Shadow Paging
Implementing implement of no steal and force: maintaining two separate copies of DB on disk
master: contain only changes from committed transactions
shadow: dirty changes, only allow write to shadow page
commit: atomically switch between shadow and master
abort: discard the shadow page
Shadow paging ensure consistent rather than performance: - copy database is expensive - transaction can either batch copy-on-write or one at a time - fragmented data and random I/O (main reason)
So in reality, nobody use shadlow paging too (SQLite abandoned around 2010)
Write-Ahead Logging: maintain a log file separate from data file that contains all database change commands
log is stable storage
log has information to perform UNDO
and REDO
to restore database
log isn't meant to be human-readable (while it can be)
log is written before commit to disk (and it can use memory first and then write to disk before commit too)
It implement steal + no-force because we allow uncommitted change to write to disk because we can UNDO
using log.
Log contains:
a <BEGIN>
is written before every transaction start
a <COMMIT>
is written before every transaction commit
a <DATA>
is written for every change to a single object containing
UNDO
)REDO
)a <CHECKPOINT>
is written once a while, stopping all transaction, to keep log short. (You still need to check for un-committed change before <CHECKPOINT>
line)
during recovery (REDO
or UNDO
), we also write to log so that we can handle failure during recovery too
a <TXN-END>
is written after <COMMIT>
is successfully written to disk (so in recovery, if we see this tag, we can safely ignore this transaction), we don't need to flush this before <COMMIT>
.
WAL ensures performance more than consistent. It is more widely used.
One reason why we don't make log look like a copy of buffer pool is so that we can perform sequential write on log. We keep one log for all transactions.
When a log page in memory is full, we switch to a new log page in memory while waiting the previous log page to write to disk.
If you have read-only transaction, you generally don't need to keep a log.
Shadow page ensures recovery performance is better than runtime performance (copy data is slow, read and switch data pointer is fast) while log ensures runtime performance is better than recovery performance (writing log is fast, but log replay is slower)
Actual Logging Implementation
Physical Logging: record bit changes log (git diff
)
Logical Logging: record high level changes to log (UPDATE, DELETE, INSERT
)
Physiological Logging: hybrid of the two (most commonly used)
Name | Where | Definition |
---|---|---|
flushedLSN | Memory | Last LSN in log on disk (change every flush) |
pageLSN | page | lastest LSN to page (change every record) |
recLSN | page | first LSN that is dirty (change every flush) |
lastLSN | Transaction | lastest LSN to txn (every txn's record) |
MasterRecord | Disk | latest checkpoint (change every checkpoint) |
prevLSN | Transaction | LSN pointer during reverse, per transaction |
undoNext | Transaction | what to undo next during recovery (prevLSN) |
Transaction Table (TT): in memory, store
transactionID
that is active (not committed)
lastLSN
of transactionID
Dirty Page Table (DPT): in memory store
pageID
: ID of a dirty page
recLSN
: first LSN that made page dirty
Compensation Log Record (CLR): the log of undo
undoNext
: next transactionID
that is about to undo
CLR never got UNDO
, we only REDO
them during recovery
ABORT
ABORT
immediately (Since if not yet written to disk, we are fine. If so by other transaction, we can revert easily. But if you want to read the result of abort, you need the following steps)lastLSN
for transactionUNDO
update using prevLSN
UNDO
, write CLR
with undoNext
Checkpoint for Clearing Log:
Traditional Checkpoint: wait until all transaction finishes, halt all incoming transaction and write CHECKPOINT
(might take very long time to wait)
Better Checkpoint: store ATT
, DPT
status in checkpoint, pause transaction and do checkpoint
Fuzzy Checkpoint: write CHECKPOINT_BEGIN
and CHECKPOINT_END
, don't pause transaction
Recovery Phrases:
TNX-END
: remove txn from ATT
UPDATE
, UNDO
: add txn to ATT
(if not already in, set recLSN = LSN
)COMMIT
: change status to COMMIT
CHECKPOINT_END
: add ATT/DPT
infomation of checkpoint to current ATT/DPT
ATT
: all transaction that is active during crashDPT
: dirty pages that might not have made it to diskDPT
, orDPT
, but LSN
less than recLSN
(already made into disk)UNDO
: sync to disk and don't need to UNDO
again for next crash
1. write log to disk
2. write page to disk
3. append CLR: <TXN-END>
to diskCLR
in log (that is not already <TNX-END>
) during UNDO
, don't UNDO
it. Because you already REDO
it in redo phrase. Instead, go to undoNext
field in the record and start UNDO
from there.WAL/ARISE can work with 2PC where we need an additional log to capture 2PC behavior
Table of Content