KC Sivaramakrishnan CTO @ Tarides

Deterministically debugging concurrent GC bugs with rr

Multicore OCaml comes with a concurrent garbage collector, where the garbage collector and the mutator threads run concurrently. Debugging concurrent GC bugs has been the most frustrating / satisfying (when fixed) part of Multicore OCaml development. rr, a record and replay tool has made debugging concurrent GC bugs a sustainable exercise. In this short post, I’ll describe why.

A particularly tricky concurrent GC bug is one which occurs once every 10 to 100 runs due to non-determinism and any attempt to instrument the program to isolate the bug (simplifying the program, adding print statements, etc.) makes it disappear. The bug may only appear relatively late in the program run – after a few major GC cycles, where the program might have allocated 10s of gigabytes of memory by then. The bug usually manifests as a segfault due to illegal memory access, but the source of the bug may lie in the previous GC cycle and perhaps due to actions of a different thread than the one that is throwing up the error. gdb often doesn’t help since finding the illegal memory access may not give any clue as to when the heap was corrupted.

rr to the rescue. rr is an enhancement over gdb with support for recording an execution and debugging in reverse. Once a failing execution is recorded, the execution can be replayed multiple times deterministically. This removes the non-determinism from debugging session. gdb does support record and replay, but not on multi-threaded targets.

The fact that the program can be run in reverse is the key for debugging heap corruptions. An illegal access typically appears as a load or store to a illegal memory address obtained from a heap object. When such an illegal access is found, I set a hardware watchpoint on the heap address containing the illegal address and continue the program in reverse. rr runs the program in reverse until the write that stored the illegal address in the heap object! Usually, several transitive reverse runs are necessary to get to the source of the bug, but this is just mechanics.

While rr supports multi-threaded programs, it runs every thread on the same core. This usually makes the bug disappear. Luckily, rr comes with support for forcing a context switch after a certain number of CPU ticks (measured in terms of the number of retired conditional branches). Even with this option, you will need many runs before rr comes across a buggy execution. So I use the following command:

for i in {1..10000}; do rr record -c 10000 <program> <args>; if (( $? == 0 )); then echo "done $i"; else break; fi; done

which runs <prog> <args> under rr where a thread is allowed to execute for a maximum of 10,000 ticks before a context switch. rr runs are repeated until a crash is found or 10,000 rr runs are successfully completed. Depending on the program being debugged, I leave it running overnight. If rr had in fact found a crash, I can perform replay debugging with rr replay the following morning and have a deterministic and reversible recorded execution to work with.

rr has save countless hours in the development of Multicore OCaml, and rr should be a essential tool in every GC hacker’s toolbox.

Creative Commons License kcsrk dot info by KC Sivaramakrishnan is licensed under a Creative Commons Attribution 4.0 International License.