DEBUG_TIPS.md

Practical notes for agents debugging this kernel with Simics.

This file does not replace SETUP.md. Read Option C or Option D there first. This file explains what to watch out for after Simics is already working.

1. Core mental model

There are two separate consoles:

Do not mix them up.

If you accidentally type a test name into simics> instead of using system.console.con.input, nothing useful happens.

2. Recommended test protocol

For serious debugging, use a fresh Simics session for each test.

Use a fixed tmux session name:

tmux kill-session -t codex-simics 2>/dev/null || true
tmux new-session -d -s codex-simics \
  'cd /path/to/p3 && SIMICS_TEXT_CONSOLE=yes ~/simics/pebsim/pebsim7'

Then run this sequence:

system.console.con.capture-start "/path/to/p3/tests/test.console.out" -overwrite
system.console.con.bp-break-console-string "[410-shell]$" -once
run

Wait for the shell breakpoint to fire. Only then send the test:

system.console.con.input "getpid_test1\n"
system.console.con.bp-break-console-string "shell: process 3 finished with exit status" -once
system.console.con.bp-break-console-string "panic" -once
system.console.con.bp-break-console-string "failed assertion" -once
run

If the exit-status breakpoint fires, do not stop there. Arm the shell prompt again and resume:

system.console.con.bp-break-console-string "[410-shell]$" -once
run

Only count the test as pass when you have both:

  1. shell: process 3 finished with exit status 0
  2. a later [410-shell]$

That rule matters for cho, cho2, and cho_variant.

3. What a pass really means

Do not say "pass" just because you saw the success line once.

For long tests, this is not enough:

shell: process 3 finished with exit status 0

This is enough:

shell: process 3 finished with exit status 0
[410-shell]$

Why this matters:

4. Long test policy

For cho, cho2, cho_variant, and similar tests:

Host-side monitoring pattern:

strings /path/to/p3/tests/cho2.console.out | tail -n 120

The capture file is the real guest output log in this workflow.

5. The capture file is weird on purpose

system.console.con.capture-start records VGA screen snapshots, not a normal line-buffered log.

Consequences:

Prefer:

strings tests/console.out | tail -n 80

If you need more control:

python3 - <<'PY'
from pathlib import Path
data = Path('tests/console.out').read_text('latin1', errors='ignore')
needle = 'shell: process 3 finished with exit status'
idx = data.find(needle)
print(idx)
print(data[idx:idx+400])
PY

6. String breakpoints fire too early

This is one of the biggest Simics debugging traps here.

If you break on:

system.console.con.bp-break-console-string "panic" -once

or:

system.console.con.bp-break-console-string "shell: process 3 finished with exit status" -once

Simics stops as soon as those bytes hit the VGA buffer. The full line may not be visible yet.

Concrete example

During cho, an early breakpoint on malloc_lock corrupted only showed:

sfree: malloc_lock corrupted

That was not enough. The useful data was still missing. Running again without the early breakpoint and stopping a little later exposed:

sfree: malloc_lock corrupted obj=0x00274000 caller=0x001020f5 ...

That address was the real clue.

What to do instead

If a breakpoint only gives you a prefix:

  1. rerun without that early breakpoint
  2. poll the capture file from the host
  3. stop only after the full line has landed

7. Address to source workflow

When VGA output gives you an address, resolve it immediately.

Use:

addr2line -e kernel -f -C 0x001020f5
nm -n kernel | rg '001020f5|malloc_lock|page_free_pt'

Typical uses:

Concrete examples

Map a caller address:

addr2line -e kernel -f -C 0x001020f5

This resolved to page_free_pt during one cho investigation.

Map a suspicious object address:

nm -n kernel | rg 'malloc_lock|001ec084'

This showed that 0x001ec084 sat inside malloc_lock.

8. Simics stop is not a guest backtrace

Do not assume this works:

stop
stack-trace

In this setup, Simics often replies:

No current debug object

So:

If the guest already printed an EIP, caller, or fault site, use addr2line. If the guest kernel debugger itself printed a stack trace, save that output from the VGA capture.

9. Narrow instrumentation beats clever instrumentation

If you add debugging checks, make them local and boring.

Good examples:

Bad examples:

Real lesson

An early probe asserted that malloc_lock must have a fixed internal state between sem_wait() and sem_signal(). That was wrong. Other waiters were allowed to change the semaphore internals legitimately, so the assertion created a false bug report.

The fix was not a kernel fix. The fix was to remove the bad probe.

10. Treat your own changes as suspects

If a bug appears after a long debugging session:

  1. inspect staged and unstaged changes
  2. inspect recent commits, especially your own
  3. compare the current design against the last known-good checkpoint

In this repo, this was often useful:

git status --short
git log --oneline --decorate -n 20
git show <commit>:kern/task/task.c | sed -n '1,220p'
git show <commit>:kern/vm/vm.c | sed -n '1,260p'

Do not assume the newest code is the best code. Regressions often come from instrumentation, defensive rewrites, or "temporary" design changes.

11. Prove the fix is real

After a promising pass, rerun the target sequence with no code changes.

If needed, prove that nothing changed:

git status --short > /tmp/status_before.txt
# run the full sequence
git status --short > /tmp/status_after.txt
diff -u /tmp/status_before.txt /tmp/status_after.txt

This matters because kernel debugging is vulnerable to accidental "fixes" caused by:

12. A small shell harness is worth it

For repeated reruns, write a tiny host-side script that:

That avoids human error in long sequences like:

The harness should still stay simple. It is only there to enforce the same decision rule every time.

13. Summary checklist

Before each run:

During the run:

After the run:

Table of Content