DEBUG_TIPS.md

Practical notes for agents debugging this kernel with Simics.

This file does not replace SETUP.md. Read Option C or Option D there first. This file explains what to watch out for after Simics is already working.

1. Core mental model

There are two separate consoles:

simics> / running> is the host simulator CLI
[410-shell]$ is the guest shell drawn on the VGA console

Do not mix them up.

run or c resumes the guest and usually changes the host prompt to running>
system.console.con.input "foo\n" types into the guest shell
system.console.con.bp-break-console-string "..." -once watches the VGA text screen, not the host CLI

If you accidentally type a test name into simics> instead of using system.console.con.input, nothing useful happens.

2. Recommended test protocol

For serious debugging, use a fresh Simics session for each test.

Use a fixed tmux session name:

tmux kill-session -t codex-simics 2>/dev/null || true
tmux new-session -d -s codex-simics \
  'cd /path/to/p3 && SIMICS_TEXT_CONSOLE=yes ~/simics/pebsim/pebsim7'

Then run this sequence:

system.console.con.capture-start "/path/to/p3/tests/test.console.out" -overwrite
system.console.con.bp-break-console-string "[410-shell]$" -once
run

Wait for the shell breakpoint to fire. Only then send the test:

system.console.con.input "getpid_test1\n"
system.console.con.bp-break-console-string "shell: process 3 finished with exit status" -once
system.console.con.bp-break-console-string "panic" -once
system.console.con.bp-break-console-string "failed assertion" -once
run

If the exit-status breakpoint fires, do not stop there. Arm the shell prompt again and resume:

system.console.con.bp-break-console-string "[410-shell]$" -once
run

Only count the test as pass when you have both:

shell: process 3 finished with exit status 0
a later [410-shell]$

That rule matters for cho, cho2, and cho_variant.

3. What a pass really means

Do not say "pass" just because you saw the success line once.

For long tests, this is not enough:

shell: process 3 finished with exit status 0

This is enough:

shell: process 3 finished with exit status 0
[410-shell]$

Why this matters:

bp-break-console-string fires mid-write
the process may have finished but the shell may not yet have regained control
some stress tests print success-looking text before they are actually done

4. Long test policy

For cho, cho2, cho_variant, and similar tests:

monitor the VGA capture, not just the tmux pane
if nothing changes for about 60 seconds and the shell has not returned, treat it as a likely deadlock
stop, collect evidence, restart fresh

Host-side monitoring pattern:

strings /path/to/p3/tests/cho2.console.out | tail -n 120

The capture file is the real guest output log in this workflow.

5. The capture file is weird on purpose

system.console.con.capture-start records VGA screen snapshots, not a normal line-buffered log.

Consequences:

the same text may appear many times
lines may be truncated in one frame and complete in a later frame
tail -n on the raw file is usually useless

Prefer:

strings tests/console.out | tail -n 80

If you need more control:

python3 - <<'PY'
from pathlib import Path
data = Path('tests/console.out').read_text('latin1', errors='ignore')
needle = 'shell: process 3 finished with exit status'
idx = data.find(needle)
print(idx)
print(data[idx:idx+400])
PY

6. String breakpoints fire too early

This is one of the biggest Simics debugging traps here.

If you break on:

system.console.con.bp-break-console-string "panic" -once

or:

system.console.con.bp-break-console-string "shell: process 3 finished with exit status" -once

Simics stops as soon as those bytes hit the VGA buffer. The full line may not be visible yet.

Concrete example

During cho, an early breakpoint on malloc_lock corrupted only showed:

sfree: malloc_lock corrupted

That was not enough. The useful data was still missing. Running again without the early breakpoint and stopping a little later exposed:

sfree: malloc_lock corrupted obj=0x00274000 caller=0x001020f5 ...

That address was the real clue.

What to do instead

If a breakpoint only gives you a prefix:

rerun without that early breakpoint
poll the capture file from the host
stop only after the full line has landed

7. Address to source workflow

When VGA output gives you an address, resolve it immediately.

Use:

addr2line -e kernel -f -C 0x001020f5
nm -n kernel | rg '001020f5|malloc_lock|page_free_pt'

Typical uses:

caller address from an assertion or panic
object address from a corrupted pointer or lock
checking whether an address is near a known global symbol

Concrete examples

Map a caller address:

addr2line -e kernel -f -C 0x001020f5

This resolved to page_free_pt during one cho investigation.

Map a suspicious object address:

nm -n kernel | rg 'malloc_lock|001ec084'

This showed that 0x001ec084 sat inside malloc_lock.

8. Simics stop is not a guest backtrace

Do not assume this works:

stop
stack-trace

In this setup, Simics often replies:

No current debug object

So:

a host-side stop only stops simulation
it does not automatically give you a guest call stack

If the guest already printed an EIP, caller, or fault site, use addr2line. If the guest kernel debugger itself printed a stack trace, save that output from the VGA capture.

9. Narrow instrumentation beats clever instrumentation

If you add debugging checks, make them local and boring.

Good examples:

assert that a pointer being freed is in the expected allocation region
assert that a queue link is NULL before enqueue
assert that a page table page is page-aligned
add a pre-sfree() overlap check before calling into the allocator

Bad examples:

inferring global ownership from the raw internals of a contended semaphore
asserting that sem_t.count == 0 during a critical section
assuming a mutex's internal fields stay unchanged while other waiters queue

Real lesson

An early probe asserted that malloc_lock must have a fixed internal state between sem_wait() and sem_signal(). That was wrong. Other waiters were allowed to change the semaphore internals legitimately, so the assertion created a false bug report.

The fix was not a kernel fix. The fix was to remove the bad probe.

10. Treat your own changes as suspects

If a bug appears after a long debugging session:

inspect staged and unstaged changes
inspect recent commits, especially your own
compare the current design against the last known-good checkpoint

In this repo, this was often useful:

git status --short
git log --oneline --decorate -n 20
git show <commit>:kern/task/task.c | sed -n '1,220p'
git show <commit>:kern/vm/vm.c | sed -n '1,260p'

Do not assume the newest code is the best code. Regressions often come from instrumentation, defensive rewrites, or "temporary" design changes.

11. Prove the fix is real

After a promising pass, rerun the target sequence with no code changes.

If needed, prove that nothing changed:

git status --short > /tmp/status_before.txt
# run the full sequence
git status --short > /tmp/status_after.txt
diff -u /tmp/status_before.txt /tmp/status_after.txt

This matters because kernel debugging is vulnerable to accidental "fixes" caused by:

stale breakpoints
stale capture files
a changed build image
an assertion you forgot to remove
a rerun that was not actually comparable to the first one

12. A small shell harness is worth it

For repeated reruns, write a tiny host-side script that:

starts a fresh codex-simics session
arms the shell breakpoint
injects the test name
watches the capture file
declares pass only on exit status 0 plus shell return
declares likely deadlock after about 60 seconds of no progress

That avoids human error in long sequences like:

minclone_many
cho
cho2
cho_variant

The harness should still stay simple. It is only there to enforce the same decision rule every time.

13. Summary checklist

Before each run:

fresh codex-simics session
fresh capture file
shell-ready breakpoint armed

During the run:

inject test only after the shell breakpoint fires
watch the capture file, not just the host pane
assume failure breakpoints may stop mid-line

After the run:

require exit status 0 and later [410-shell]$
if failing, resolve addresses with addr2line
distrust fancy instrumentation before blaming the kernel
rerun with no edits before declaring victory

Table of Content