This document explains the design of the student kernel in kern/.
It does not describe:
the 410* support code
the user/ tree
the tests
The goal is to give a new reader enough context to understand:
what the kernel is trying to do
how control flows through it
which data structures own which state
how the major race and locking problems are solved
The document is intentionally conceptual. It explains the design and the invariants behind the code rather than restating individual functions.
The kernel is a single-CPU, interrupt-driven, preemptive x86 kernel with per-task user address spaces and kernel-resident scheduling state.
At a high level, the kernel provides:
a boot path that initializes devices, paging, and the first runnable tasks
one task abstraction per user process
one thread abstraction per schedulable execution context
round-robin scheduling across runnable threads
kernel stacks and saved register frames for every thread
demand-zero user memory and explicit user page allocation
process creation, thread creation, exec, exit, and wait
timer, keyboard, console, and readfile support
software exception delivery through swexn
kernel synchronization primitives used internally by the kernel
The design is intentionally small and centralized:
one global scheduler lock protects scheduler-owned state
one runnable queue holds all runnable threads
one sleep queue holds all sleeping threads
one task registry and one thread registry provide global lookup
each task owns its own thread list and child-zombie queue
This kernel does not try to hide the hardware. Instead, it leans on a few simple invariants and keeps the assembly/C boundary narrow and explicit.
The easiest way to understand the kernel is to think of it as four connected layers.
The CPU is either:
running user code
running kernel code on behalf of the current thread
switching from one saved thread context to another
Every user thread has a kernel stack. When the CPU traps into the kernel, the kernel runs on that thread's kernel stack. The kernel never needs a separate "kernel thread" abstraction.
The scheduler owns:
which thread is current
which threads are runnable
which threads are sleeping
which tasks and threads are globally registered
parent/child task relationships
which blocked states may be force-woken safely
This is the most concurrency-sensitive layer, so it is protected by a single spinlock with interrupts disabled.
The memory subsystem owns:
the kernel page directory
the free physical frame pool
per-task page directories and page tables
user mappings
ZFOD state
The kernel relies on a shared low-memory direct map so it can always access its own code, data, stacks, and paging structures regardless of the current task.
The task/thread subsystem owns:
task and thread creation
kernel stack allocation
loading executables
fork and thread fork
exec replacement of an address space
orderly exit
zombie publication and reaping
This layer sits between the memory layer and the scheduler layer, so most of its design work is really about preserving their invariants at the same time.
Kernel startup is intentionally linear.
The kernel entry path performs these steps in order:
EFLAGSidle taskinit taskThe first context switch never returns to kernel_main().
idle ExistsThe kernel always wants at least one runnable thread available. Loading an idle task during boot guarantees that a timer interrupt or a blocking operation does not leave the scheduler without a legal target.
After boot, execution repeatedly follows this loop:
This "save frame, update state, restore frame" model is the core control-flow idea of the kernel.
A task represents a user process. It owns:
a user page directory
a process exit status
parent/child relationship metadata
a queue of child zombies that have exited but not yet been waited on
a list of its threads
a live thread count
a dying flag used during task-wide teardown
A task remains allocated after its last live thread exits. It becomes a zombie
until its parent reaps it with wait().
A thread represents one schedulable execution context. It owns:
its kernel stack
its saved kernel stack pointer
its thread ID
a pointer to its owning task
queue links for every queue it may appear in
block-state metadata
sleep metadata
swexn registration state
The intrusive queue links are important. A thread can simultaneously belong to:
the task's thread list
a condition-variable wait queue
the scheduler sleep queue
Using separate link fields avoids heap allocation for queue nodes and makes ownership visible inside the thread object itself.
The scheduler owns:
current_task
current_thread
the runnable queue
the sleep queue
the global task registry
the global thread registry
the special init task pointer used for orphan adoption
This is all protected by the scheduler lock.
The design splits virtual memory into two large regions:
low memory is kernel-reserved and shared
high memory is user-managed and task-specific
The user stack lives at the top of the 32-bit address space and grows downward.
Every task page directory contains the same low-memory kernel mapping.
This is one of the most important design decisions in the kernel.
It means:
kernel text and data are always reachable
every kernel stack is always reachable
interrupt entry does not need to rebuild a temporary kernel map
the kernel can inspect and manipulate low physical memory directly
page-table management code can switch CR3 without losing access to itself
The direct map is limited and deliberate rather than a full mirrored copy of all physical memory.
Above the shared kernel region, each task owns its own user mappings. Those mappings are created by:
program loading
fork address-space copying
explicit new_pages()
stack setup during load or exec
The kernel tears these mappings down explicitly when the last thread in a task
exits or when exec() replaces the address space.
Physical user frames are managed by a simple free list embedded in the free frames themselves.
This allocator is intentionally minimal:
page-sized allocations only
push and pop on a singly linked free list
zero-fill on allocation
That design fits the actual kernel needs well because user memory, ZFOD promotion, and stack pages are all page-granularity objects.
Page directories and page tables are treated as explicit paging-structure objects, separate from ordinary user frames.
This separation matters because "free user memory" and "free paging metadata" are not the same operation.
The kernel therefore keeps two conceptual teardown steps:
free user mappings and user page tables
free the page directory object itself
This avoids ambiguous APIs and makes teardown order explicit.
Executable loading works segment by segment.
For each ELF segment, the kernel:
validates the ELF metadata
allocates and maps enough user pages
copies file-backed bytes into those pages
uses zero-filled pages for the remainder of memory-backed but file-empty data
The user stack is built separately after segments are loaded.
The kernel constructs the initial user stack directly in the mapped top stack page. It copies argument strings, lays out the argument pointer array, and places the conventional startup values expected by user startup code.
The design goal is simple:
keep the loader responsible for user-visible process entry state
keep the scheduler responsible only for switching to an already-valid thread
Zero-fill-on-demand is used to avoid allocating physical memory for every page up front.
The design uses:
one shared zero page
software PTE flagging to mark ZFOD mappings
page-fault-time promotion on first write
When a write fault hits a present ZFOD mapping, the kernel allocates a real frame, updates the PTE, and resumes execution. This keeps the common case fast and makes BSS and explicit new-page allocation cheap until the pages are actually written.
new_pages() and remove_pages()The kernel supports explicit user allocation with new_pages().
Instead of maintaining a separate side metadata structure for each allocation,
the kernel stores region boundary markers directly in the first and last PTE of
the allocated range. That makes remove_pages() simple:
verify the starting page is a region start
walk forward page by page
stop at the region end marker
This design avoids a second allocation table and keeps removal tied to the page tables themselves.
Syscall handlers validate user pointers before reading or writing through them.
The kernel distinguishes:
readable single addresses
readable or writable ranges
readable strings with bounded length
This keeps validation local to the syscall layer and prevents deeper kernel subsystems from having to reason about raw untrusted user pointers.
All traps, interrupts, and syscall entries are normalized into a common saved register frame on the current thread's kernel stack.
The assembly wrappers perform the hardware-specific save/restore work. The C handlers only see a structured register snapshot.
This separation is deliberate:
assembly does the minimum hardware-sensitive work
C code owns policy
context switching becomes "pick a different saved stack pointer"
CR3 in the Trap FrameThe trap path repurposes one field in the saved register frame to remember the
user CR3 value that was active when the trap occurred.
This is a key design choice because the kernel frequently needs to validate user addresses or reason about the interrupted task's address space even after switching back to the kernel page directory.
The one notable exception is swexn, where that field is reused to carry the
software exception cause instead.
There are two conceptual return paths from the kernel:
return to a user-mode thread
resume a saved kernel continuation
The first switch into a fresh thread is treated as a special case through a sentinel value in the saved stack frame. That keeps initial launch and later resumption on the same machinery without needing a separate scheduler model.
esp0Before returning to user mode, the dispatcher updates the TSS kernel-stack pointer to the current thread's kernel stack top. This ensures that the next ring transition enters the kernel on the correct stack.
The scheduler is a single global round-robin scheduler over threads.
There is:
one runnable queue
one current thread
one current task
timer-driven preemption
The design favors simplicity over per-task fairness or priority scheduling.
The scheduler uses a spinlock and disables interrupts while it is held.
This lock protects all scheduler-owned state, including:
runnable queue operations
sleep queue operations
current_task and current_thread
task/thread registry operations
task parent-child metadata
block-reason transitions
The reason for this design is that scheduler state is touched from:
syscalls
timer interrupts
keyboard interrupts indirectly through wakeups
exit paths
A sleeping lock would be too dangerous here because the scheduler is precisely the code that decides who may sleep and wake.
The kernel classifies blocked threads by why they are blocked.
That block reason is not just descriptive. It is part of the scheduler safety model. The kernel distinguishes:
user deschedule()
task wait()
timer sleep
kernel synchronization wait
This design solves a real problem: not every blocked thread may be woken by the same actor.
Examples:
user make_runnable() may only wake user-descheduled threads
condition variables and other kernel sync primitives wake only sync waiters
task death may safely wake task waiters, user-descheduled threads, and sleeping threads
task death must not blindly wake threads blocked inside arbitrary kernel sync protocols
Without this classification, wakeups would be overly permissive and could break invariants inside synchronization primitives.
Sleeping threads are kept in a separate sleep queue. The timer interrupt:
increments the global tick count
wakes every thread whose wake time has arrived
acknowledges the PIC
immediately invokes the scheduler
This keeps the sleep mechanism completely scheduler-owned.
Task creation allocates:
the task structure
the initial thread structure
the thread's kernel stack
a page directory cloned from the shared kernel mapping
At this point the new task exists, but it is not yet globally visible until the scheduler registers it and its first thread becomes runnable.
That publication order is intentional. Partially initialized tasks are never visible to unrelated kernel code.
Each thread gets a multi-page kernel stack. The lowest page is left unmapped as a guard page so overflow becomes a page fault instead of silent corruption.
Kernel stacks are expensive page-aligned objects, so the kernel also keeps a small cache of recently freed stacks. This reduces churn in high-frequency thread creation workloads while preserving the guard-page design.
fork()Task fork creates a new task with a copied user address space and a new thread whose saved register frame is derived from the parent's interrupted frame.
The child shares no user pages with the parent. The design chooses a deep copy instead of copy-on-write to keep fault handling and lifetime rules simpler.
thread_fork()Thread fork creates a new thread in the same task, so:
the address space is shared
only a new kernel stack and saved execution context are needed
This cleanly separates process creation from same-address-space thread creation.
exec()exec() is allowed only when a task has a single thread.
This is a deliberate simplification. Replacing the address space of a multi-threaded task would require either killing sibling threads or reasoning about partially replaced shared state. Rejecting that case keeps the model coherent.
The exec path validates and copies user arguments first, builds a new address space, then commits atomically by switching the task's page directory pointer and freeing the old address space.
When a thread exits:
it marks itself exiting
it decrements the owning task's live thread count
if it was the last live thread, it frees the task's user address space
it unregisters itself from the global thread registry
it hands control to the scheduler, which marks it dead and never runs it again
The thread does not free its own thread object or its own running stack. That would be unsafe because it is still executing on that stack. Reaping is always done by another thread later.
The last exiting thread turns the task into a zombie. At that point:
the address space is already gone
no live threads remain
the task object still exists
exited thread objects may still need to be freed
The zombie task is then published to its parent's child-zombie queue.
This is the core task-lifecycle idea:
address-space teardown happens at task death time
metadata teardown happens at parent reap time
That split keeps wait/reap logic simpler and prevents parents from observing half-dead address spaces.
A task may contain dead thread objects that have exited but not yet been freed.
The kernel reaps them opportunistically from a live thread in the same task or from the parent when reaping a zombie task. The implementation first removes reapable thread objects from the protected task thread list, then frees them outside the task lock.
This avoids doing allocator work while holding the per-task lock.
wait()wait() is built around the parent-owned child-zombie queue.
The path is:
There is no separate "wait queue object". Waiting uses the same deschedule path as other blocking code, but with a dedicated block reason.
This was an important design choice because it reuses the scheduler's existing
context-switch machinery instead of inventing a second one just for wait().
When a task dies, its living and zombie children are reparented to init.
This ensures:
every zombie task still has a valid eventual reaper
the kernel never needs a global "unowned zombie task" structure
parent death does not strand children permanently
The scheduler performs reparenting while holding the scheduler lock because parent-child relationships are scheduler-owned state.
Task death is one of the trickiest parts of the kernel because it crosses:
task state
thread state
scheduler queues
synchronization wait states
page-table teardown
parent-child publication
The kernel solves this with three major rules.
The current thread may safely:
mark itself exiting
free the task address space if it is the last live thread
unregister itself
stop being scheduled
It may not safely:
free its own running kernel stack
free its own thread object
assume blocked siblings are all safe to kill immediately
When a task is marked dying, the scheduler wakes sibling threads only if they are in states that are safe to break out of:
task wait
user deschedule
sleep
It does not force-wake arbitrary kernel synchronization waiters. Those threads may currently be participating in a mutex, semaphore, or condition-variable protocol. Breaking that protocol from the side would risk corrupting the primitive's internal invariants.
If a sibling thread is not already safely blocked, the kernel does not destroy
it in the middle of kernel execution. Instead, handler_post_wrap checks
whether the task is dying when the thread is about to return to user mode. If
so, the thread exits before re-entering user space.
This guarantees that the thread is not holding kernel-only control-flow state when it is killed.
Spinlocks are used only where blocking is impossible or unsafe, especially in the scheduler and interrupt-adjacent paths.
The rule is simple:
if the code owns scheduler state, use a spinlock with interrupts disabled
do not sleep while holding it
Kernel mutexes are lightweight ownership locks built on atomic exchange plus cooperative yielding to the owner when contended.
This design is simpler than putting blocked mutex waiters on a separate queue. It works because the kernel is uniprocessor and the owner can make progress once scheduled.
The cost is lack of bounded fairness, which is acceptable for this project.
Condition variables solve the race between:
releasing the associated mutex
going to sleep
another thread signalling in between
The kernel handles this by:
Signal and broadcast do the inverse:
dequeue waiters
set their reject flag
make them runnable
The loop in the waiter is intentional. It tolerates wakeups that are not the target condition becoming true.
Semaphores are built from:
an internal mutex
a condition variable
an integer count
This keeps the implementation small and makes semaphore behavior a thin policy layer over the already-understood mutex and condition-variable machinery.
swexnThe IDT uses:
interrupt gates for hardware interrupts and CPU exceptions
trap gates for user syscalls
This preserves the usual x86 expectations:
hardware and exception handling runs with interrupts masked by default
user syscalls may remain trap-like
Page faults are handled in this order:
swexn if the fault came from user modeThis policy gives user code one chance to recover through swexn without
making kernel faults silently survivable.
User-mode protection faults also flow through swexn first and otherwise cause
task death. Kernel-mode protection faults are fatal and panic immediately.
swexnswexn gives a thread a one-shot user-level exception handler plus an explicit
exception stack.
The kernel validates:
the exception stack pointer
the handler address
an optional replacement register frame
On delivery, the kernel clears the registration first, then builds a user exception frame on the supplied exception stack and redirects execution to the handler.
The one-shot behavior is intentional. It avoids recursive accidental reuse of a stale handler registration.
The console maintains a software shadow buffer that mirrors the VGA text buffer.
This makes console logic easier because operations such as scrolling, cursor movement, and recoloring are expressed against normal memory first and then flushed to VGA memory.
Print is serialized by a semaphore so two threads do not interleave bytes on the screen.
Keyboard input is deliberately split into two stages:
a raw scancode ring buffer filled by the interrupt handler
a decoded character buffer used by getchar() and readline()
This split has two benefits:
the interrupt handler stays short and nonblocking
line editing policy stays in ordinary kernel code, not in interrupt context
The two buffers use different drop policies:
raw scancodes drop oldest entries when overrun
retained decoded characters drop newest entries
That reflects their different roles. Raw input wants to keep recent hardware activity. Retained cooked input wants to preserve already-buffered characters.
readfileThe kernel serves executable data directly from the in-memory application table
of contents. readfile(".") is treated as a synthetic directory listing of all
available program names.
This keeps the project small by avoiding a real filesystem while still exposing file-like behavior to user programs.
halt() passes through to the simulator halt path. The remaining miscellaneous
calls are intentionally thin wrappers around kernel services.
The intrusive queue macros are the default queue mechanism inside the kernel.
They are used because they:
avoid heap allocation for queue nodes
make queue membership explicit in the owning object
support one object appearing in multiple independent queues through multiple link fields
This is especially useful for threads, which move between several scheduler and sync queues during their lifetime.
The kernel's hash table is a simple open-addressed table with tombstones and rehashing. It is used for the global task and thread registries.
That design fits the workload well:
lookup by numeric ID
insert and remove during lifecycle transitions
occasional full-table scan when reparenting or waking task-owned threads
auto_arr is a generic growable pointer array utility. It is not central to
the kernel's main control flow, but it exists as a simple student-owned helper
for indexed pointer storage.
The most important invariants in the kernel are:
scheduler-owned state is modified only while holding the scheduler lock with interrupts disabled
per-task thread-list state is modified only while holding the task's
threads_lock
a thread is freed only after it is unregistered from the global thread registry and removed from every queue
a zombie task published to a parent already has no live threads and no live address space
a dying task never lets a sibling thread return to user mode
user-space teardown frees only user mappings and user paging structures, not the shared kernel mapping
The main lock-partitioning idea is:
the scheduler lock owns cross-task execution state
task-local locks own per-task membership state
synchronization primitives own their internal queues
This partition keeps the kernel understandable and avoids folding all lifecycle state into one giant global lock.
The most important design decisions in this kernel are:
This is what makes the rest of the design practical. It lets interrupts, dispatch, and paging code run without constantly rebuilding a kernel mapping.
wait() and synchronization waits use the same context-save and deschedule
machinery as the rest of the kernel. The kernel does not invent a second
control-flow path for a special class of blocking.
This keeps the assembly boundary small and makes all blocking operations obey the same scheduler invariants.
Parent-child relationships, zombie publication, and orphan adoption are owned by the scheduler layer rather than being split across ad hoc task-side queues.
That matters because publication and wakeup are scheduling events, not just container mutations.
Threads do not free their own execution resources. Another thread reaps them later. This avoids self-freeing stacks, which is one of the easiest ways to corrupt control flow in a thread kernel.
Guard pages catch overflows. The stack cache avoids allocator churn. Keeping both was worth the extra complexity because thread-heavy workloads depend on fast stack turnover, but silent kernel-stack corruption is unacceptable.
A new reader should approach the kernel in this order:
kernel.c for startuptask/, scheduler/, and vm/ together for the main execution modeldispatcher/ and int_handlers/ for control transfersync/ for blocking and wakeup semanticsio/ and the smaller syscall handlers for device-facing behaviorThe kernel is easiest to reason about when viewed as a small set of invariants shared across these modules rather than as many unrelated functions.
Table of Content