DESIGN.md

1. Purpose and Scope

This document explains the design of the student kernel in kern/. It does not describe:

the 410* support code
the user/ tree
the tests

The goal is to give a new reader enough context to understand:

what the kernel is trying to do
how control flows through it
which data structures own which state
how the major race and locking problems are solved

The document is intentionally conceptual. It explains the design and the invariants behind the code rather than restating individual functions.

2. System Overview

The kernel is a single-CPU, interrupt-driven, preemptive x86 kernel with per-task user address spaces and kernel-resident scheduling state.

At a high level, the kernel provides:

a boot path that initializes devices, paging, and the first runnable tasks
one task abstraction per user process
one thread abstraction per schedulable execution context
round-robin scheduling across runnable threads
kernel stacks and saved register frames for every thread
demand-zero user memory and explicit user page allocation
process creation, thread creation, exec, exit, and wait
timer, keyboard, console, and readfile support
software exception delivery through swexn
kernel synchronization primitives used internally by the kernel

The design is intentionally small and centralized:

one global scheduler lock protects scheduler-owned state
one runnable queue holds all runnable threads
one sleep queue holds all sleeping threads
one task registry and one thread registry provide global lookup
each task owns its own thread list and child-zombie queue

This kernel does not try to hide the hardware. Instead, it leans on a few simple invariants and keeps the assembly/C boundary narrow and explicit.

3. Mental Model

The easiest way to understand the kernel is to think of it as four connected layers.

3.1 Execution Layer

The CPU is either:

running user code
running kernel code on behalf of the current thread
switching from one saved thread context to another

Every user thread has a kernel stack. When the CPU traps into the kernel, the kernel runs on that thread's kernel stack. The kernel never needs a separate "kernel thread" abstraction.

3.2 Scheduling Layer

The scheduler owns:

which thread is current
which threads are runnable
which threads are sleeping
which tasks and threads are globally registered
parent/child task relationships
which blocked states may be force-woken safely

This is the most concurrency-sensitive layer, so it is protected by a single spinlock with interrupts disabled.

3.3 Memory Layer

The memory subsystem owns:

the kernel page directory
the free physical frame pool
per-task page directories and page tables
user mappings
ZFOD state

The kernel relies on a shared low-memory direct map so it can always access its own code, data, stacks, and paging structures regardless of the current task.

3.4 Lifecycle Layer

The task/thread subsystem owns:

task and thread creation
kernel stack allocation
loading executables
fork and thread fork
exec replacement of an address space
orderly exit
zombie publication and reaping

This layer sits between the memory layer and the scheduler layer, so most of its design work is really about preserving their invariants at the same time.

4. Boot and Top-Level Control Flow

Kernel startup is intentionally linear.

4.1 Boot Sequence

The kernel entry path performs these steps in order:

initialize global kernel state
initialize scheduler data structures
initialize the kernel TID allocator lock
initialize keyboard buffering
install the IDT and timer configuration
initialize and clear the console
initialize paging and load the kernel page directory into CR3
compute the default user EFLAGS
load the idle task
load the init task
switch into the scheduler

The first context switch never returns to kernel_main().

4.2 Why `idle` Exists

The kernel always wants at least one runnable thread available. Loading an idle task during boot guarantees that a timer interrupt or a blocking operation does not leave the scheduler without a legal target.

4.3 Normal Runtime Path

After boot, execution repeatedly follows this loop:

user code runs
a syscall, interrupt, or exception traps into the kernel
assembly wrapper saves a register frame on the current thread's kernel stack
C handler updates kernel state
control either returns to the same thread or switches to a different one
dispatcher restores registers and returns to user mode or kernel continuation

This "save frame, update state, restore frame" model is the core control-flow idea of the kernel.

5. Core Objects and Ownership

5.1 Task

A task represents a user process. It owns:

a user page directory
a process exit status
parent/child relationship metadata
a queue of child zombies that have exited but not yet been waited on
a list of its threads
a live thread count
a dying flag used during task-wide teardown

A task remains allocated after its last live thread exits. It becomes a zombie until its parent reaps it with wait().

5.2 Thread

A thread represents one schedulable execution context. It owns:

its kernel stack
its saved kernel stack pointer
its thread ID
a pointer to its owning task
queue links for every queue it may appear in
block-state metadata
sleep metadata
swexn registration state

The intrusive queue links are important. A thread can simultaneously belong to:

the task's thread list
a condition-variable wait queue
the scheduler sleep queue

Using separate link fields avoids heap allocation for queue nodes and makes ownership visible inside the thread object itself.

5.3 Scheduler-Owned Global State

The scheduler owns:

current_task
current_thread
the runnable queue
the sleep queue
the global task registry
the global thread registry
the special init task pointer used for orphan adoption

This is all protected by the scheduler lock.

6. Address Space Design

6.1 Virtual Layout

The design splits virtual memory into two large regions:

low memory is kernel-reserved and shared
high memory is user-managed and task-specific

The user stack lives at the top of the 32-bit address space and grows downward.

6.2 Shared Low-Memory Direct Map

Every task page directory contains the same low-memory kernel mapping.

This is one of the most important design decisions in the kernel.

It means:

kernel text and data are always reachable
every kernel stack is always reachable
interrupt entry does not need to rebuild a temporary kernel map
the kernel can inspect and manipulate low physical memory directly
page-table management code can switch CR3 without losing access to itself

The direct map is limited and deliberate rather than a full mirrored copy of all physical memory.

6.3 Per-Task User Space

Above the shared kernel region, each task owns its own user mappings. Those mappings are created by:

program loading
fork address-space copying
explicit new_pages()
stack setup during load or exec

The kernel tears these mappings down explicitly when the last thread in a task exits or when exec() replaces the address space.

6.4 Physical Frame Allocation

Physical user frames are managed by a simple free list embedded in the free frames themselves.

This allocator is intentionally minimal:

page-sized allocations only
push and pop on a singly linked free list
zero-fill on allocation

That design fits the actual kernel needs well because user memory, ZFOD promotion, and stack pages are all page-granularity objects.

6.5 Page Directories and Page Tables

Page directories and page tables are treated as explicit paging-structure objects, separate from ordinary user frames.

This separation matters because "free user memory" and "free paging metadata" are not the same operation.

The kernel therefore keeps two conceptual teardown steps:

free user mappings and user page tables
free the page directory object itself

This avoids ambiguous APIs and makes teardown order explicit.

7. Virtual Memory Policy

7.1 Program Loading

Executable loading works segment by segment.

For each ELF segment, the kernel:

validates the ELF metadata
allocates and maps enough user pages
copies file-backed bytes into those pages
uses zero-filled pages for the remainder of memory-backed but file-empty data

The user stack is built separately after segments are loaded.

7.2 Stack Construction

The kernel constructs the initial user stack directly in the mapped top stack page. It copies argument strings, lays out the argument pointer array, and places the conventional startup values expected by user startup code.

The design goal is simple:

keep the loader responsible for user-visible process entry state
keep the scheduler responsible only for switching to an already-valid thread

7.3 ZFOD

Zero-fill-on-demand is used to avoid allocating physical memory for every page up front.

The design uses:

one shared zero page
software PTE flagging to mark ZFOD mappings
page-fault-time promotion on first write

When a write fault hits a present ZFOD mapping, the kernel allocates a real frame, updates the PTE, and resumes execution. This keeps the common case fast and makes BSS and explicit new-page allocation cheap until the pages are actually written.

7.4 `new_pages()` and `remove_pages()`

The kernel supports explicit user allocation with new_pages().

Instead of maintaining a separate side metadata structure for each allocation, the kernel stores region boundary markers directly in the first and last PTE of the allocated range. That makes remove_pages() simple:

verify the starting page is a region start
walk forward page by page
stop at the region end marker

This design avoids a second allocation table and keeps removal tied to the page tables themselves.

7.5 Pointer Validation

Syscall handlers validate user pointers before reading or writing through them.

The kernel distinguishes:

readable single addresses
readable or writable ranges
readable strings with bounded length

This keeps validation local to the syscall layer and prevents deeper kernel subsystems from having to reason about raw untrusted user pointers.

8. Dispatcher and Trap Handling

8.1 Register-Frame Design

All traps, interrupts, and syscall entries are normalized into a common saved register frame on the current thread's kernel stack.

The assembly wrappers perform the hardware-specific save/restore work. The C handlers only see a structured register snapshot.

This separation is deliberate:

assembly does the minimum hardware-sensitive work
C code owns policy
context switching becomes "pick a different saved stack pointer"

8.2 Saved `CR3` in the Trap Frame

The trap path repurposes one field in the saved register frame to remember the user CR3 value that was active when the trap occurred.

This is a key design choice because the kernel frequently needs to validate user addresses or reason about the interrupted task's address space even after switching back to the kernel page directory.

The one notable exception is swexn, where that field is reused to carry the software exception cause instead.

8.3 Two Return Paths

There are two conceptual return paths from the kernel:

return to a user-mode thread
resume a saved kernel continuation

The first switch into a fresh thread is treated as a special case through a sentinel value in the saved stack frame. That keeps initial launch and later resumption on the same machinery without needing a separate scheduler model.

8.4 TSS `esp0`

Before returning to user mode, the dispatcher updates the TSS kernel-stack pointer to the current thread's kernel stack top. This ensures that the next ring transition enters the kernel on the correct stack.

9. Scheduler Design

9.1 Policy

The scheduler is a single global round-robin scheduler over threads.

There is:

one runnable queue
one current thread
one current task
timer-driven preemption

The design favors simplicity over per-task fairness or priority scheduling.

9.2 Locking Model

The scheduler uses a spinlock and disables interrupts while it is held.

This lock protects all scheduler-owned state, including:

runnable queue operations
sleep queue operations
current_task and current_thread
task/thread registry operations
task parent-child metadata
block-reason transitions

The reason for this design is that scheduler state is touched from:

syscalls
timer interrupts
keyboard interrupts indirectly through wakeups
exit paths

A sleeping lock would be too dangerous here because the scheduler is precisely the code that decides who may sleep and wake.

9.3 Blocking Classification

The kernel classifies blocked threads by why they are blocked.

That block reason is not just descriptive. It is part of the scheduler safety model. The kernel distinguishes:

user deschedule()
task wait()
timer sleep
kernel synchronization wait

This design solves a real problem: not every blocked thread may be woken by the same actor.

Examples:

user make_runnable() may only wake user-descheduled threads
condition variables and other kernel sync primitives wake only sync waiters
task death may safely wake task waiters, user-descheduled threads, and sleeping threads
task death must not blindly wake threads blocked inside arbitrary kernel sync protocols

Without this classification, wakeups would be overly permissive and could break invariants inside synchronization primitives.

9.4 Sleep Queue

Sleeping threads are kept in a separate sleep queue. The timer interrupt:

increments the global tick count
wakes every thread whose wake time has arrived
acknowledges the PIC
immediately invokes the scheduler

This keeps the sleep mechanism completely scheduler-owned.

10. Task and Thread Lifecycle

10.1 Creation

Task creation allocates:

the task structure
the initial thread structure
the thread's kernel stack
a page directory cloned from the shared kernel mapping

At this point the new task exists, but it is not yet globally visible until the scheduler registers it and its first thread becomes runnable.

That publication order is intentional. Partially initialized tasks are never visible to unrelated kernel code.

10.2 Kernel Stacks

Each thread gets a multi-page kernel stack. The lowest page is left unmapped as a guard page so overflow becomes a page fault instead of silent corruption.

Kernel stacks are expensive page-aligned objects, so the kernel also keeps a small cache of recently freed stacks. This reduces churn in high-frequency thread creation workloads while preserving the guard-page design.

10.3 `fork()`

Task fork creates a new task with a copied user address space and a new thread whose saved register frame is derived from the parent's interrupted frame.

The child shares no user pages with the parent. The design chooses a deep copy instead of copy-on-write to keep fault handling and lifetime rules simpler.

10.4 `thread_fork()`

Thread fork creates a new thread in the same task, so:

the address space is shared
only a new kernel stack and saved execution context are needed

This cleanly separates process creation from same-address-space thread creation.

10.5 `exec()`

exec() is allowed only when a task has a single thread.

This is a deliberate simplification. Replacing the address space of a multi-threaded task would require either killing sibling threads or reasoning about partially replaced shared state. Rejecting that case keeps the model coherent.

The exec path validates and copies user arguments first, builds a new address space, then commits atomically by switching the task's page directory pointer and freeing the old address space.

10.6 Normal Thread Exit

When a thread exits:

it marks itself exiting
it decrements the owning task's live thread count
if it was the last live thread, it frees the task's user address space
it unregisters itself from the global thread registry
it hands control to the scheduler, which marks it dead and never runs it again

The thread does not free its own thread object or its own running stack. That would be unsafe because it is still executing on that stack. Reaping is always done by another thread later.

10.7 Task Exit and Zombies

The last exiting thread turns the task into a zombie. At that point:

the address space is already gone
no live threads remain
the task object still exists
exited thread objects may still need to be freed

The zombie task is then published to its parent's child-zombie queue.

This is the core task-lifecycle idea:

address-space teardown happens at task death time
metadata teardown happens at parent reap time

That split keeps wait/reap logic simpler and prevents parents from observing half-dead address spaces.

10.8 Reaping Dead Threads

A task may contain dead thread objects that have exited but not yet been freed.

The kernel reaps them opportunistically from a live thread in the same task or from the parent when reaping a zombie task. The implementation first removes reapable thread objects from the protected task thread list, then frees them outside the task lock.

This avoids doing allocator work while holding the per-task lock.

10.9 `wait()`

wait() is built around the parent-owned child-zombie queue.

The path is:

reap any dead threads in the current task
check whether a child zombie is already queued
if yes, dequeue it and reap it
if no children exist, return error
otherwise mark the caller as blocked in the task-wait class and deschedule

There is no separate "wait queue object". Waiting uses the same deschedule path as other blocking code, but with a dedicated block reason.

This was an important design choice because it reuses the scheduler's existing context-switch machinery instead of inventing a second one just for wait().

10.10 Orphan Reparenting

When a task dies, its living and zombie children are reparented to init.

This ensures:

every zombie task still has a valid eventual reaper
the kernel never needs a global "unowned zombie task" structure
parent death does not strand children permanently

The scheduler performs reparenting while holding the scheduler lock because parent-child relationships are scheduler-owned state.

11. Why Task Death Is Hard

Task death is one of the trickiest parts of the kernel because it crosses:

task state
thread state
scheduler queues
synchronization wait states
page-table teardown
parent-child publication

The kernel solves this with three major rules.

11.1 Rule 1: Only Free What Is Safe From the Current Context

The current thread may safely:

mark itself exiting
free the task address space if it is the last live thread
unregister itself
stop being scheduled

It may not safely:

free its own running kernel stack
free its own thread object
assume blocked siblings are all safe to kill immediately

11.2 Rule 2: Only Force-Wake Safe Blocked States

When a task is marked dying, the scheduler wakes sibling threads only if they are in states that are safe to break out of:

task wait
user deschedule
sleep

It does not force-wake arbitrary kernel synchronization waiters. Those threads may currently be participating in a mutex, semaphore, or condition-variable protocol. Breaking that protocol from the side would risk corrupting the primitive's internal invariants.

11.3 Rule 3: Kill Threads Only at a Safe Boundary

If a sibling thread is not already safely blocked, the kernel does not destroy it in the middle of kernel execution. Instead, handler_post_wrap checks whether the task is dying when the thread is about to return to user mode. If so, the thread exits before re-entering user space.

This guarantees that the thread is not holding kernel-only control-flow state when it is killed.

12. Synchronization Primitives

12.1 Spinlocks

Spinlocks are used only where blocking is impossible or unsafe, especially in the scheduler and interrupt-adjacent paths.

The rule is simple:

if the code owns scheduler state, use a spinlock with interrupts disabled
do not sleep while holding it

12.2 Mutexes

Kernel mutexes are lightweight ownership locks built on atomic exchange plus cooperative yielding to the owner when contended.

This design is simpler than putting blocked mutex waiters on a separate queue. It works because the kernel is uniprocessor and the owner can make progress once scheduled.

The cost is lack of bounded fairness, which is acceptable for this project.

12.3 Condition Variables

Condition variables solve the race between:

releasing the associated mutex
going to sleep
another thread signalling in between

The kernel handles this by:

acquiring the condition-variable internal lock
inserting the current thread into the wait queue
setting a per-thread reject flag to zero
releasing the user mutex
releasing the condition-variable internal lock
repeatedly descheduling until the reject flag becomes nonzero

Signal and broadcast do the inverse:

dequeue waiters
set their reject flag
make them runnable

The loop in the waiter is intentional. It tolerates wakeups that are not the target condition becoming true.

12.4 Semaphores

Semaphores are built from:

an internal mutex
a condition variable
an integer count

This keeps the implementation small and makes semaphore behavior a thin policy layer over the already-understood mutex and condition-variable machinery.

13. Interrupts, Exceptions, and `swexn`

13.1 IDT Policy

The IDT uses:

interrupt gates for hardware interrupts and CPU exceptions
trap gates for user syscalls

This preserves the usual x86 expectations:

hardware and exception handling runs with interrupts masked by default
user syscalls may remain trap-like

13.2 Page Fault Policy

Page faults are handled in this order:

detect and reject obvious invalid accesses such as null faults
if the fault is a write to a ZFOD mapping, promote the page and resume
otherwise try swexn if the fault came from user mode
otherwise kill the task with failure status
kernel-mode faults panic

This policy gives user code one chance to recover through swexn without making kernel faults silently survivable.

13.3 General Protection Fault Policy

User-mode protection faults also flow through swexn first and otherwise cause task death. Kernel-mode protection faults are fatal and panic immediately.

13.4 `swexn`

swexn gives a thread a one-shot user-level exception handler plus an explicit exception stack.

The kernel validates:

the exception stack pointer
the handler address
an optional replacement register frame

On delivery, the kernel clears the registration first, then builds a user exception frame on the supplied exception stack and redirects execution to the handler.

The one-shot behavior is intentional. It avoids recursive accidental reuse of a stale handler registration.

14. Console, Keyboard, and Miscellaneous Syscalls

14.1 Console

The console maintains a software shadow buffer that mirrors the VGA text buffer.

This makes console logic easier because operations such as scrolling, cursor movement, and recoloring are expressed against normal memory first and then flushed to VGA memory.

Print is serialized by a semaphore so two threads do not interleave bytes on the screen.

14.2 Keyboard

Keyboard input is deliberately split into two stages:

a raw scancode ring buffer filled by the interrupt handler
a decoded character buffer used by getchar() and readline()

This split has two benefits:

the interrupt handler stays short and nonblocking
line editing policy stays in ordinary kernel code, not in interrupt context

The two buffers use different drop policies:

raw scancodes drop oldest entries when overrun
retained decoded characters drop newest entries

That reflects their different roles. Raw input wants to keep recent hardware activity. Retained cooked input wants to preserve already-buffered characters.

14.3 `readfile`

The kernel serves executable data directly from the in-memory application table of contents. readfile(".") is treated as a synthetic directory listing of all available program names.

This keeps the project small by avoiding a real filesystem while still exposing file-like behavior to user programs.

14.4 Halt and Miscellaneous Calls

halt() passes through to the simulator halt path. The remaining miscellaneous calls are intentionally thin wrappers around kernel services.

15. Supporting Data Structures

15.1 Intrusive Queues

The intrusive queue macros are the default queue mechanism inside the kernel.

They are used because they:

avoid heap allocation for queue nodes
make queue membership explicit in the owning object
support one object appearing in multiple independent queues through multiple link fields

This is especially useful for threads, which move between several scheduler and sync queues during their lifetime.

15.2 Hash Tables

The kernel's hash table is a simple open-addressed table with tombstones and rehashing. It is used for the global task and thread registries.

That design fits the workload well:

lookup by numeric ID
insert and remove during lifecycle transitions
occasional full-table scan when reparenting or waking task-owned threads

15.3 Auto Array

auto_arr is a generic growable pointer array utility. It is not central to the kernel's main control flow, but it exists as a simple student-owned helper for indexed pointer storage.

16. Locking and Invariant Summary

The most important invariants in the kernel are:

scheduler-owned state is modified only while holding the scheduler lock with interrupts disabled
per-task thread-list state is modified only while holding the task's threads_lock
a thread is freed only after it is unregistered from the global thread registry and removed from every queue
a zombie task published to a parent already has no live threads and no live address space
a dying task never lets a sibling thread return to user mode
user-space teardown frees only user mappings and user paging structures, not the shared kernel mapping

The main lock-partitioning idea is:

the scheduler lock owns cross-task execution state
task-local locks own per-task membership state
synchronization primitives own their internal queues

This partition keeps the kernel understandable and avoids folding all lifecycle state into one giant global lock.

17. Major Design Decisions

The most important design decisions in this kernel are:

17.1 Shared Kernel Mapping in Every Address Space

This is what makes the rest of the design practical. It lets interrupts, dispatch, and paging code run without constantly rebuilding a kernel mapping.

17.2 Reuse of the General Deschedule Path

wait() and synchronization waits use the same context-save and deschedule machinery as the rest of the kernel. The kernel does not invent a second control-flow path for a special class of blocking.

This keeps the assembly boundary small and makes all blocking operations obey the same scheduler invariants.

17.3 Scheduler-Owned Parent/Child Metadata

Parent-child relationships, zombie publication, and orphan adoption are owned by the scheduler layer rather than being split across ad hoc task-side queues.

That matters because publication and wakeup are scheduling events, not just container mutations.

17.4 Deferred Thread Reaping

Threads do not free their own execution resources. Another thread reaps them later. This avoids self-freeing stacks, which is one of the easiest ways to corrupt control flow in a thread kernel.

17.5 Guarded and Cached Kernel Stacks

Guard pages catch overflows. The stack cache avoids allocator churn. Keeping both was worth the extra complexity because thread-heavy workloads depend on fast stack turnover, but silent kernel-stack corruption is unacceptable.

18. Reading Guide

A new reader should approach the kernel in this order:

kernel.c for startup
task/, scheduler/, and vm/ together for the main execution model
dispatcher/ and int_handlers/ for control transfer
sync/ for blocking and wakeup semantics
io/ and the smaller syscall handlers for device-facing behavior

The kernel is easiest to reason about when viewed as a small set of invariants shared across these modules rather than as many unrelated functions.

Table of Content