DESIGN.md

1. Purpose and Scope

This document explains the design of the student kernel in kern/. It does not describe:

The goal is to give a new reader enough context to understand:

The document is intentionally conceptual. It explains the design and the invariants behind the code rather than restating individual functions.

2. System Overview

The kernel is a single-CPU, interrupt-driven, preemptive x86 kernel with per-task user address spaces and kernel-resident scheduling state.

At a high level, the kernel provides:

The design is intentionally small and centralized:

This kernel does not try to hide the hardware. Instead, it leans on a few simple invariants and keeps the assembly/C boundary narrow and explicit.

3. Mental Model

The easiest way to understand the kernel is to think of it as four connected layers.

3.1 Execution Layer

The CPU is either:

Every user thread has a kernel stack. When the CPU traps into the kernel, the kernel runs on that thread's kernel stack. The kernel never needs a separate "kernel thread" abstraction.

3.2 Scheduling Layer

The scheduler owns:

This is the most concurrency-sensitive layer, so it is protected by a single spinlock with interrupts disabled.

3.3 Memory Layer

The memory subsystem owns:

The kernel relies on a shared low-memory direct map so it can always access its own code, data, stacks, and paging structures regardless of the current task.

3.4 Lifecycle Layer

The task/thread subsystem owns:

This layer sits between the memory layer and the scheduler layer, so most of its design work is really about preserving their invariants at the same time.

4. Boot and Top-Level Control Flow

Kernel startup is intentionally linear.

4.1 Boot Sequence

The kernel entry path performs these steps in order:

  1. initialize global kernel state
  2. initialize scheduler data structures
  3. initialize the kernel TID allocator lock
  4. initialize keyboard buffering
  5. install the IDT and timer configuration
  6. initialize and clear the console
  7. initialize paging and load the kernel page directory into CR3
  8. compute the default user EFLAGS
  9. load the idle task
  10. load the init task
  11. switch into the scheduler

The first context switch never returns to kernel_main().

4.2 Why idle Exists

The kernel always wants at least one runnable thread available. Loading an idle task during boot guarantees that a timer interrupt or a blocking operation does not leave the scheduler without a legal target.

4.3 Normal Runtime Path

After boot, execution repeatedly follows this loop:

  1. user code runs
  2. a syscall, interrupt, or exception traps into the kernel
  3. assembly wrapper saves a register frame on the current thread's kernel stack
  4. C handler updates kernel state
  5. control either returns to the same thread or switches to a different one
  6. dispatcher restores registers and returns to user mode or kernel continuation

This "save frame, update state, restore frame" model is the core control-flow idea of the kernel.

5. Core Objects and Ownership

5.1 Task

A task represents a user process. It owns:

A task remains allocated after its last live thread exits. It becomes a zombie until its parent reaps it with wait().

5.2 Thread

A thread represents one schedulable execution context. It owns:

The intrusive queue links are important. A thread can simultaneously belong to:

Using separate link fields avoids heap allocation for queue nodes and makes ownership visible inside the thread object itself.

5.3 Scheduler-Owned Global State

The scheduler owns:

This is all protected by the scheduler lock.

6. Address Space Design

6.1 Virtual Layout

The design splits virtual memory into two large regions:

The user stack lives at the top of the 32-bit address space and grows downward.

6.2 Shared Low-Memory Direct Map

Every task page directory contains the same low-memory kernel mapping.

This is one of the most important design decisions in the kernel.

It means:

The direct map is limited and deliberate rather than a full mirrored copy of all physical memory.

6.3 Per-Task User Space

Above the shared kernel region, each task owns its own user mappings. Those mappings are created by:

The kernel tears these mappings down explicitly when the last thread in a task exits or when exec() replaces the address space.

6.4 Physical Frame Allocation

Physical user frames are managed by a simple free list embedded in the free frames themselves.

This allocator is intentionally minimal:

That design fits the actual kernel needs well because user memory, ZFOD promotion, and stack pages are all page-granularity objects.

6.5 Page Directories and Page Tables

Page directories and page tables are treated as explicit paging-structure objects, separate from ordinary user frames.

This separation matters because "free user memory" and "free paging metadata" are not the same operation.

The kernel therefore keeps two conceptual teardown steps:

This avoids ambiguous APIs and makes teardown order explicit.

7. Virtual Memory Policy

7.1 Program Loading

Executable loading works segment by segment.

For each ELF segment, the kernel:

The user stack is built separately after segments are loaded.

7.2 Stack Construction

The kernel constructs the initial user stack directly in the mapped top stack page. It copies argument strings, lays out the argument pointer array, and places the conventional startup values expected by user startup code.

The design goal is simple:

7.3 ZFOD

Zero-fill-on-demand is used to avoid allocating physical memory for every page up front.

The design uses:

When a write fault hits a present ZFOD mapping, the kernel allocates a real frame, updates the PTE, and resumes execution. This keeps the common case fast and makes BSS and explicit new-page allocation cheap until the pages are actually written.

7.4 new_pages() and remove_pages()

The kernel supports explicit user allocation with new_pages().

Instead of maintaining a separate side metadata structure for each allocation, the kernel stores region boundary markers directly in the first and last PTE of the allocated range. That makes remove_pages() simple:

This design avoids a second allocation table and keeps removal tied to the page tables themselves.

7.5 Pointer Validation

Syscall handlers validate user pointers before reading or writing through them.

The kernel distinguishes:

This keeps validation local to the syscall layer and prevents deeper kernel subsystems from having to reason about raw untrusted user pointers.

8. Dispatcher and Trap Handling

8.1 Register-Frame Design

All traps, interrupts, and syscall entries are normalized into a common saved register frame on the current thread's kernel stack.

The assembly wrappers perform the hardware-specific save/restore work. The C handlers only see a structured register snapshot.

This separation is deliberate:

8.2 Saved CR3 in the Trap Frame

The trap path repurposes one field in the saved register frame to remember the user CR3 value that was active when the trap occurred.

This is a key design choice because the kernel frequently needs to validate user addresses or reason about the interrupted task's address space even after switching back to the kernel page directory.

The one notable exception is swexn, where that field is reused to carry the software exception cause instead.

8.3 Two Return Paths

There are two conceptual return paths from the kernel:

The first switch into a fresh thread is treated as a special case through a sentinel value in the saved stack frame. That keeps initial launch and later resumption on the same machinery without needing a separate scheduler model.

8.4 TSS esp0

Before returning to user mode, the dispatcher updates the TSS kernel-stack pointer to the current thread's kernel stack top. This ensures that the next ring transition enters the kernel on the correct stack.

9. Scheduler Design

9.1 Policy

The scheduler is a single global round-robin scheduler over threads.

There is:

The design favors simplicity over per-task fairness or priority scheduling.

9.2 Locking Model

The scheduler uses a spinlock and disables interrupts while it is held.

This lock protects all scheduler-owned state, including:

The reason for this design is that scheduler state is touched from:

A sleeping lock would be too dangerous here because the scheduler is precisely the code that decides who may sleep and wake.

9.3 Blocking Classification

The kernel classifies blocked threads by why they are blocked.

That block reason is not just descriptive. It is part of the scheduler safety model. The kernel distinguishes:

This design solves a real problem: not every blocked thread may be woken by the same actor.

Examples:

Without this classification, wakeups would be overly permissive and could break invariants inside synchronization primitives.

9.4 Sleep Queue

Sleeping threads are kept in a separate sleep queue. The timer interrupt:

This keeps the sleep mechanism completely scheduler-owned.

10. Task and Thread Lifecycle

10.1 Creation

Task creation allocates:

At this point the new task exists, but it is not yet globally visible until the scheduler registers it and its first thread becomes runnable.

That publication order is intentional. Partially initialized tasks are never visible to unrelated kernel code.

10.2 Kernel Stacks

Each thread gets a multi-page kernel stack. The lowest page is left unmapped as a guard page so overflow becomes a page fault instead of silent corruption.

Kernel stacks are expensive page-aligned objects, so the kernel also keeps a small cache of recently freed stacks. This reduces churn in high-frequency thread creation workloads while preserving the guard-page design.

10.3 fork()

Task fork creates a new task with a copied user address space and a new thread whose saved register frame is derived from the parent's interrupted frame.

The child shares no user pages with the parent. The design chooses a deep copy instead of copy-on-write to keep fault handling and lifetime rules simpler.

10.4 thread_fork()

Thread fork creates a new thread in the same task, so:

This cleanly separates process creation from same-address-space thread creation.

10.5 exec()

exec() is allowed only when a task has a single thread.

This is a deliberate simplification. Replacing the address space of a multi-threaded task would require either killing sibling threads or reasoning about partially replaced shared state. Rejecting that case keeps the model coherent.

The exec path validates and copies user arguments first, builds a new address space, then commits atomically by switching the task's page directory pointer and freeing the old address space.

10.6 Normal Thread Exit

When a thread exits:

The thread does not free its own thread object or its own running stack. That would be unsafe because it is still executing on that stack. Reaping is always done by another thread later.

10.7 Task Exit and Zombies

The last exiting thread turns the task into a zombie. At that point:

The zombie task is then published to its parent's child-zombie queue.

This is the core task-lifecycle idea:

That split keeps wait/reap logic simpler and prevents parents from observing half-dead address spaces.

10.8 Reaping Dead Threads

A task may contain dead thread objects that have exited but not yet been freed.

The kernel reaps them opportunistically from a live thread in the same task or from the parent when reaping a zombie task. The implementation first removes reapable thread objects from the protected task thread list, then frees them outside the task lock.

This avoids doing allocator work while holding the per-task lock.

10.9 wait()

wait() is built around the parent-owned child-zombie queue.

The path is:

  1. reap any dead threads in the current task
  2. check whether a child zombie is already queued
  3. if yes, dequeue it and reap it
  4. if no children exist, return error
  5. otherwise mark the caller as blocked in the task-wait class and deschedule

There is no separate "wait queue object". Waiting uses the same deschedule path as other blocking code, but with a dedicated block reason.

This was an important design choice because it reuses the scheduler's existing context-switch machinery instead of inventing a second one just for wait().

10.10 Orphan Reparenting

When a task dies, its living and zombie children are reparented to init.

This ensures:

The scheduler performs reparenting while holding the scheduler lock because parent-child relationships are scheduler-owned state.

11. Why Task Death Is Hard

Task death is one of the trickiest parts of the kernel because it crosses:

The kernel solves this with three major rules.

11.1 Rule 1: Only Free What Is Safe From the Current Context

The current thread may safely:

It may not safely:

11.2 Rule 2: Only Force-Wake Safe Blocked States

When a task is marked dying, the scheduler wakes sibling threads only if they are in states that are safe to break out of:

It does not force-wake arbitrary kernel synchronization waiters. Those threads may currently be participating in a mutex, semaphore, or condition-variable protocol. Breaking that protocol from the side would risk corrupting the primitive's internal invariants.

11.3 Rule 3: Kill Threads Only at a Safe Boundary

If a sibling thread is not already safely blocked, the kernel does not destroy it in the middle of kernel execution. Instead, handler_post_wrap checks whether the task is dying when the thread is about to return to user mode. If so, the thread exits before re-entering user space.

This guarantees that the thread is not holding kernel-only control-flow state when it is killed.

12. Synchronization Primitives

12.1 Spinlocks

Spinlocks are used only where blocking is impossible or unsafe, especially in the scheduler and interrupt-adjacent paths.

The rule is simple:

12.2 Mutexes

Kernel mutexes are lightweight ownership locks built on atomic exchange plus cooperative yielding to the owner when contended.

This design is simpler than putting blocked mutex waiters on a separate queue. It works because the kernel is uniprocessor and the owner can make progress once scheduled.

The cost is lack of bounded fairness, which is acceptable for this project.

12.3 Condition Variables

Condition variables solve the race between:

The kernel handles this by:

  1. acquiring the condition-variable internal lock
  2. inserting the current thread into the wait queue
  3. setting a per-thread reject flag to zero
  4. releasing the user mutex
  5. releasing the condition-variable internal lock
  6. repeatedly descheduling until the reject flag becomes nonzero

Signal and broadcast do the inverse:

The loop in the waiter is intentional. It tolerates wakeups that are not the target condition becoming true.

12.4 Semaphores

Semaphores are built from:

This keeps the implementation small and makes semaphore behavior a thin policy layer over the already-understood mutex and condition-variable machinery.

13. Interrupts, Exceptions, and swexn

13.1 IDT Policy

The IDT uses:

This preserves the usual x86 expectations:

13.2 Page Fault Policy

Page faults are handled in this order:

  1. detect and reject obvious invalid accesses such as null faults
  2. if the fault is a write to a ZFOD mapping, promote the page and resume
  3. otherwise try swexn if the fault came from user mode
  4. otherwise kill the task with failure status
  5. kernel-mode faults panic

This policy gives user code one chance to recover through swexn without making kernel faults silently survivable.

13.3 General Protection Fault Policy

User-mode protection faults also flow through swexn first and otherwise cause task death. Kernel-mode protection faults are fatal and panic immediately.

13.4 swexn

swexn gives a thread a one-shot user-level exception handler plus an explicit exception stack.

The kernel validates:

On delivery, the kernel clears the registration first, then builds a user exception frame on the supplied exception stack and redirects execution to the handler.

The one-shot behavior is intentional. It avoids recursive accidental reuse of a stale handler registration.

14. Console, Keyboard, and Miscellaneous Syscalls

14.1 Console

The console maintains a software shadow buffer that mirrors the VGA text buffer.

This makes console logic easier because operations such as scrolling, cursor movement, and recoloring are expressed against normal memory first and then flushed to VGA memory.

Print is serialized by a semaphore so two threads do not interleave bytes on the screen.

14.2 Keyboard

Keyboard input is deliberately split into two stages:

This split has two benefits:

The two buffers use different drop policies:

That reflects their different roles. Raw input wants to keep recent hardware activity. Retained cooked input wants to preserve already-buffered characters.

14.3 readfile

The kernel serves executable data directly from the in-memory application table of contents. readfile(".") is treated as a synthetic directory listing of all available program names.

This keeps the project small by avoiding a real filesystem while still exposing file-like behavior to user programs.

14.4 Halt and Miscellaneous Calls

halt() passes through to the simulator halt path. The remaining miscellaneous calls are intentionally thin wrappers around kernel services.

15. Supporting Data Structures

15.1 Intrusive Queues

The intrusive queue macros are the default queue mechanism inside the kernel.

They are used because they:

This is especially useful for threads, which move between several scheduler and sync queues during their lifetime.

15.2 Hash Tables

The kernel's hash table is a simple open-addressed table with tombstones and rehashing. It is used for the global task and thread registries.

That design fits the workload well:

15.3 Auto Array

auto_arr is a generic growable pointer array utility. It is not central to the kernel's main control flow, but it exists as a simple student-owned helper for indexed pointer storage.

16. Locking and Invariant Summary

The most important invariants in the kernel are:

The main lock-partitioning idea is:

This partition keeps the kernel understandable and avoids folding all lifecycle state into one giant global lock.

17. Major Design Decisions

The most important design decisions in this kernel are:

17.1 Shared Kernel Mapping in Every Address Space

This is what makes the rest of the design practical. It lets interrupts, dispatch, and paging code run without constantly rebuilding a kernel mapping.

17.2 Reuse of the General Deschedule Path

wait() and synchronization waits use the same context-save and deschedule machinery as the rest of the kernel. The kernel does not invent a second control-flow path for a special class of blocking.

This keeps the assembly boundary small and makes all blocking operations obey the same scheduler invariants.

17.3 Scheduler-Owned Parent/Child Metadata

Parent-child relationships, zombie publication, and orphan adoption are owned by the scheduler layer rather than being split across ad hoc task-side queues.

That matters because publication and wakeup are scheduling events, not just container mutations.

17.4 Deferred Thread Reaping

Threads do not free their own execution resources. Another thread reaps them later. This avoids self-freeing stacks, which is one of the easiest ways to corrupt control flow in a thread kernel.

17.5 Guarded and Cached Kernel Stacks

Guard pages catch overflows. The stack cache avoids allocator churn. Keeping both was worth the extra complexity because thread-heavy workloads depend on fast stack turnover, but silent kernel-stack corruption is unacceptable.

18. Reading Guide

A new reader should approach the kernel in this order:

  1. kernel.c for startup
  2. task/, scheduler/, and vm/ together for the main execution model
  3. dispatcher/ and int_handlers/ for control transfer
  4. sync/ for blocking and wakeup semantics
  5. io/ and the smaller syscall handlers for device-facing behavior

The kernel is easiest to reason about when viewed as a small set of invariants shared across these modules rather than as many unrelated functions.

Table of Content