Deterministically debugging concurrent GC bugs with rr
28 Apr 2019Multicore OCaml comes with a concurrent garbage collector, where the garbage collector and the mutator threads run concurrently. Debugging concurrent GC bugs has been the most frustrating / satisfying (when fixed) part of Multicore OCaml development. rr, a record and replay tool has made debugging concurrent GC bugs a sustainable exercise. In this short post, I’ll describe why.
A particularly tricky concurrent GC bug is one which occurs once every 10 to 100
runs due to non-determinism and any attempt to instrument the program to isolate
the bug (simplifying the program, adding print statements, etc.) makes it
disappear. The bug may only appear relatively late in the program run – after a
few major GC cycles, where the program might have allocated 10s of gigabytes of
memory by then. The bug usually manifests as a segfault due to illegal memory
access, but the source of the bug may lie in the previous GC cycle and perhaps
due to actions of a different thread than the one that is throwing up the error.
gdb
often doesn’t help since finding the illegal memory access may not give
any clue as to when the heap was corrupted.
rr
to the rescue. rr
is an enhancement over gdb
with support for recording
an execution and debugging in reverse. Once a failing execution is recorded,
the execution can be replayed multiple times deterministically. This removes the
non-determinism from debugging session. gdb
does support record and replay,
but not on multi-threaded targets.
The fact that the program can be run in reverse is the key for debugging heap
corruptions. An illegal access typically appears as a load or store to a illegal
memory address obtained from a heap object. When such an illegal access is
found, I set a hardware watchpoint on the heap address containing the illegal
address and continue the program in reverse. rr
runs the program in reverse until
the write that stored the illegal address in the heap object! Usually, several
transitive reverse runs are necessary to get to the source of the bug, but this
is just mechanics.
While rr
supports multi-threaded programs, it runs every thread on the same
core. This usually makes the bug disappear. Luckily, rr
comes with support for
forcing a context switch after a certain number of CPU ticks (measured in terms
of the number of retired conditional branches). Even with this option, you will
need many runs before rr
comes across a buggy execution. So I use the
following command:
for i in {1..10000}; do rr record -c 10000 <program> <args>; if (( $? == 0 )); then echo "done $i"; else break; fi; done
which runs <prog> <args>
under rr
where a thread is allowed to execute for a
maximum of 10,000 ticks before a context switch. rr
runs are repeated until a
crash is found or 10,000 rr
runs are successfully completed. Depending on the
program being debugged, I leave it running overnight. If rr
had in fact found
a crash, I can perform replay debugging with rr replay
the following morning
and have a deterministic and reversible recorded execution to work with.
rr
has save countless hours in the development of Multicore OCaml, and rr
should be a essential tool in every GC hacker’s toolbox.