Practical notes for agents debugging this kernel with Simics.
This file does not replace SETUP.md. Read Option C or Option D there first.
This file explains what to watch out for after Simics is already working.
There are two separate consoles:
simics> / running> is the host simulator CLI
[410-shell]$ is the guest shell drawn on the VGA console
Do not mix them up.
run or c resumes the guest and usually changes the host prompt to
running>
system.console.con.input "foo\n" types into the guest shell
system.console.con.bp-break-console-string "..." -once watches the VGA
text screen, not the host CLI
If you accidentally type a test name into simics> instead of using
system.console.con.input, nothing useful happens.
For serious debugging, use a fresh Simics session for each test.
Use a fixed tmux session name:
tmux kill-session -t codex-simics 2>/dev/null || true
tmux new-session -d -s codex-simics \
'cd /path/to/p3 && SIMICS_TEXT_CONSOLE=yes ~/simics/pebsim/pebsim7'
Then run this sequence:
system.console.con.capture-start "/path/to/p3/tests/test.console.out" -overwrite
system.console.con.bp-break-console-string "[410-shell]$" -once
run
Wait for the shell breakpoint to fire. Only then send the test:
system.console.con.input "getpid_test1\n"
system.console.con.bp-break-console-string "shell: process 3 finished with exit status" -once
system.console.con.bp-break-console-string "panic" -once
system.console.con.bp-break-console-string "failed assertion" -once
run
If the exit-status breakpoint fires, do not stop there. Arm the shell prompt again and resume:
system.console.con.bp-break-console-string "[410-shell]$" -once
run
Only count the test as pass when you have both:
shell: process 3 finished with exit status 0[410-shell]$That rule matters for cho, cho2, and cho_variant.
Do not say "pass" just because you saw the success line once.
For long tests, this is not enough:
shell: process 3 finished with exit status 0
This is enough:
shell: process 3 finished with exit status 0
[410-shell]$
Why this matters:
bp-break-console-string fires mid-write
the process may have finished but the shell may not yet have regained control
some stress tests print success-looking text before they are actually done
For cho, cho2, cho_variant, and similar tests:
monitor the VGA capture, not just the tmux pane
if nothing changes for about 60 seconds and the shell has not returned, treat it as a likely deadlock
stop, collect evidence, restart fresh
Host-side monitoring pattern:
strings /path/to/p3/tests/cho2.console.out | tail -n 120
The capture file is the real guest output log in this workflow.
system.console.con.capture-start records VGA screen snapshots, not a normal
line-buffered log.
Consequences:
the same text may appear many times
lines may be truncated in one frame and complete in a later frame
tail -n on the raw file is usually useless
Prefer:
strings tests/console.out | tail -n 80
If you need more control:
python3 - <<'PY'
from pathlib import Path
data = Path('tests/console.out').read_text('latin1', errors='ignore')
needle = 'shell: process 3 finished with exit status'
idx = data.find(needle)
print(idx)
print(data[idx:idx+400])
PY
This is one of the biggest Simics debugging traps here.
If you break on:
system.console.con.bp-break-console-string "panic" -once
or:
system.console.con.bp-break-console-string "shell: process 3 finished with exit status" -once
Simics stops as soon as those bytes hit the VGA buffer. The full line may not be visible yet.
During cho, an early breakpoint on malloc_lock corrupted only showed:
sfree: malloc_lock corrupted
That was not enough. The useful data was still missing. Running again without the early breakpoint and stopping a little later exposed:
sfree: malloc_lock corrupted obj=0x00274000 caller=0x001020f5 ...
That address was the real clue.
If a breakpoint only gives you a prefix:
When VGA output gives you an address, resolve it immediately.
Use:
addr2line -e kernel -f -C 0x001020f5
nm -n kernel | rg '001020f5|malloc_lock|page_free_pt'
Typical uses:
caller address from an assertion or panic
object address from a corrupted pointer or lock
checking whether an address is near a known global symbol
Map a caller address:
addr2line -e kernel -f -C 0x001020f5
This resolved to page_free_pt during one cho investigation.
Map a suspicious object address:
nm -n kernel | rg 'malloc_lock|001ec084'
This showed that 0x001ec084 sat inside malloc_lock.
Do not assume this works:
stop
stack-trace
In this setup, Simics often replies:
No current debug object
So:
a host-side stop only stops simulation
it does not automatically give you a guest call stack
If the guest already printed an EIP, caller, or fault site, use addr2line.
If the guest kernel debugger itself printed a stack trace, save that output
from the VGA capture.
If you add debugging checks, make them local and boring.
Good examples:
assert that a pointer being freed is in the expected allocation region
assert that a queue link is NULL before enqueue
assert that a page table page is page-aligned
add a pre-sfree() overlap check before calling into the allocator
Bad examples:
inferring global ownership from the raw internals of a contended semaphore
asserting that sem_t.count == 0 during a critical section
assuming a mutex's internal fields stay unchanged while other waiters queue
An early probe asserted that malloc_lock must have a fixed internal state
between sem_wait() and sem_signal(). That was wrong. Other waiters were
allowed to change the semaphore internals legitimately, so the assertion
created a false bug report.
The fix was not a kernel fix. The fix was to remove the bad probe.
If a bug appears after a long debugging session:
In this repo, this was often useful:
git status --short
git log --oneline --decorate -n 20
git show <commit>:kern/task/task.c | sed -n '1,220p'
git show <commit>:kern/vm/vm.c | sed -n '1,260p'
Do not assume the newest code is the best code. Regressions often come from instrumentation, defensive rewrites, or "temporary" design changes.
After a promising pass, rerun the target sequence with no code changes.
If needed, prove that nothing changed:
git status --short > /tmp/status_before.txt
# run the full sequence
git status --short > /tmp/status_after.txt
diff -u /tmp/status_before.txt /tmp/status_after.txt
This matters because kernel debugging is vulnerable to accidental "fixes" caused by:
stale breakpoints
stale capture files
a changed build image
an assertion you forgot to remove
a rerun that was not actually comparable to the first one
For repeated reruns, write a tiny host-side script that:
starts a fresh codex-simics session
arms the shell breakpoint
injects the test name
watches the capture file
declares pass only on exit status 0 plus shell return
declares likely deadlock after about 60 seconds of no progress
That avoids human error in long sequences like:
minclone_many
cho
cho2
cho_variant
The harness should still stay simple. It is only there to enforce the same decision rule every time.
Before each run:
fresh codex-simics session
fresh capture file
shell-ready breakpoint armed
During the run:
inject test only after the shell breakpoint fires
watch the capture file, not just the host pane
assume failure breakpoints may stop mid-line
After the run:
require exit status 0 and later [410-shell]$
if failing, resolve addresses with addr2line
distrust fancy instrumentation before blaming the kernel
rerun with no edits before declaring victory
Table of Content