We recently discovered that we would very sporadically get a segmentation fault when running the whole test suite of the VAST Platform (VA Smalltalk) on Linux.
During our initial investigation, we realized this was not something new, but that the problem had existed for many years. It happens only on Linux (not Windows), and with all program variants: with and without JIT compiler, with and without memory sharing, 32 and 64-bit, etc. At least we knew it was not related to those broader differences.
In the end, we were able to find and fix the bug. In this post, I’ll outline some of the experiments we performed and conclude with the final GDB approach that ultimately did the trick.
Attempts at running the whole test suite
For the internal development of the VAST Platform, we use a huge test suite of about 16,000 total tests. When diagnosing an issue like this, the first obvious step was to just run the suite and see if we would get the “segmentation fault”. Obviously we did not. Otherwise, we would have fixed this long ago. We then tried randomizing the order of the tests, running the suite multiple times, etc. Still no luck.
One of the peculiarities about this issue is that there are very few scenarios that could cause a segmentation fault in VAST. But when it does occur, it normally dumps a vmtrap.log
which provides you a lot of useful information for debugging. Unfortunately, in this case, the log was not being produced. In addition, when getting this segmentation fault, it seemed that VAST was doing a regular successful exit (quitting the image, and shutting down the VM).
That made us think that the problem could be related to the “exit code” in the Virtual Machine. The next step was to write a bash script that would do a lot of image starts & exits so we could observe the exit codes:
#!/bin/bash
# "Settings"
vaRoot="/usr/local/VASmalltalk/10.0.0x64"
vaVMPath="$vaRoot/bin/es"
workingDirectoryPath="$PWD"
log="$workingDirectoryPath/ctest.log"
# copied them from environments directory.
# The icx needs to have SUnitOneClick map loaded
imagePath="$workingDirectoryPath/abt.icx"
iniPath="$workingDirectoryPath/abt.ini"
cnfPath="$vaRoot/abt.cnf"
# The -no_break is strictly necessary because else it looks like
# the VM randomly freezes and we get defunct processes ..
# so the only workaround for now is to disable the break process.
vaVMArguments=" -no_break -mcd -msd -i$imagePath -ini:$iniPath"
# This testcases.txt file can be autogenerated with the Smalltalk code:
# CtestSunitRunner basicNew createTestCaseList
testCaseFile="$workingDirectoryPath/testcases.txt"
echo "Testcase file: $testCaseFile"
# Doing a cd is important here because else the ICs won't be found in the
# defined IC_ROOT in the .ini
cd $vaRoot
# If we don't specify this LANG or any other compatible one
# we would get quite some test failures
export LANG=en_US.iso88591
# Necessary, else the esvm40.so for example is not found
export LD_LIBRARY_PATH="$vaRoot/bin"
# Just to always start with a fresh log
[-e "$log"] && rm "$log"
echo " ======== Starting process ======="
set -e
while IFS='' read -r line || [[-n "$line"]]; do
# Print some status on the console...the rest..directly into the log.
echo "Line: $line"
# Below two lines create forces a new abt.cnf files and appends
# the necessary code to run the test with Ctest
echo "PostStartUp ! " > $cnfPath
# This line just extracts the class name from the line and appends the
# code to abt.cnf
echo "SunitRunner run: #`echo "$line" | cut -d '#' -f2- | cut -d '-' -f1` !" >> $cnfPath
# Start VA VM with all the necessary arguments and redirect stdout/stderr
# to the specified log file
$vaVMPath $vaVMArguments < /dev/null >> "$log" 2>&1
echo "Line finished"
done < "$testCaseFile"
echo " ======== Process finished ======="
echo `cat "$log" | grep "CTestFailure"`
I don’t want to get into the gory details as that’s not the focus of this post, but basically, the script does the following:
- Reads the file
testcases.txt
which has allTestCase
subclasses (one test class name per line) of the whole test suite. - For each test class, it auto-generates a file called
abt.cnf
that is automatically read by VAST at startup, and that file includes the code to run that test (SunitRunner run: #MyTestCaseSubclassExample
) . - For each test class, it launches VAST. At startup, it runs the test. When done, it will print the test results and exit.
- It will keep repeating steps 2 and 3 until the end of
testcases.txt
Finally, when running this bash script, we were able to sporadically reproduce that segmentation fault. (Sometimes we needed to run this script a few times to get to the crash.)
Trying to run with GDB
Ok…we were able to somehow, every once in a while, reproduce the crash with the “release” VM. The next step was to compile the VAST VM in “debug” mode and cross our fingers that we’d be able to still reproduce the bug. (Sometimes life isn’t easy and bugs don’t show up with the debug flavor).
I normally compile and run the VAST VM under GDB with a command like this:
export VM_COMPILED_BIN=/home/mpeck/Instantiations/git/vm-devel/devel/build/bin; \
export VA_ROOT=/usr/local/VASmalltalk/10.0.0x64; \
cd $VA_ROOT ; \
export LANG=en_US.iso88591 ; \
export LD_LIBRARY_PATH=$VM_COMPILED_BIN:$VA_ROOT/bin; \
gdb -ex run --args \
$VM_COMPILED_BIN/es -no_break -ini:/home/mpeck/Instantiations/SUnitOneClick/abt.ini -i/home/mpeck/Instantiations/SUnitOneClick/abt.icx |& tee gdb.log
Therefore, I altered above bash script and tried to replace the line:
$vaVMPath $vaVMArguments < /dev/null >> "$log" 2>&1
with something like:
gdb -ex run --args $vaVMPath $vaVMArguments >> "$log" 2>&1
Now, after running that new version of the bash script we received both good and bad news. The good news was that the bug still shows up with the debug VM. The bad news was that gdb
ends the session if it’s run in a bash script. Therefore, I wasn’t able to get into the GDB console once the segmentation fault was being triggered.
Narrowing down the reproducible case
Reproducing the crash with the bash script fueled even more of our suspicions relating to the “exit code” in the VM. We thought that maybe it wasn’t necessary to run the whole suite of tests, but rather, just running one single test class hundreds of times could still reproduce it.
In other words, altering testcases.txt
with a single line with a single test (TestProcessPrims
in our case) and adding an outer loop in bash to repeat the process 1000 times.
Our hunch was right and it still reproduced the problem. Now this was going to be much easier.
Run a program 1000 times with GDB
Finally…we are getting close to the most helpful piece of this post.
We knew that all we needed to do was to run VAST 1000 times with gdb
with an abt.cnf
like below, but without ending the GDB session when the segmentation fault was hit.
PostStartUp !
SunitRunner run: #TestProcessPrims !
Researching online how to do this, we arrived at this useful post that shows how you can write a GDB script for that.
We created the file /tmp/script_file.txt
as follow:
set pagination off
handle SIGUSR1 noprint nostop
handle SIGUSR2 noprint nostop
set $n = 1000
while $n-- > 0
printf "\n\n=========================\nstarting program: %d\n", 1000-$n
run
if $_siginfo
printf "Received signal %d, stopping\n", $_siginfo.si_signo
loop_break
else
printf "program exited\n"
end
end
And then ran GDB and VAST like this:
export VM_COMPILED_BIN=/home/mpeck/Instantiations/git/vm-devel/devel/build/bin; \
export VA_ROOT=/usr/local/VASmalltalk/10.0.0x64; \
cd $VA_ROOT ; \
export LANG=en_US.iso88591 ; \
export LD_LIBRARY_PATH=$VM_COMPILED_BIN:$VA_ROOT/bin; \
gdb -x /tmp/script_file.txt --args \
$VM_COMPILED_BIN/es -no_break -ini:/home/mpeck/Instantiations/SUnitOneClick/abt.ini -i/home/mpeck/Instantiations/SUnitOneClick/abt.icx
Note the gdb -x /tmp/script_file.txt
.
Finally!!!! When running that, we were able to hit the segmentation fault, get into the GDB console, and debug.
The line 96 of estimer.c
was what we needed to understand the problem and fix it.
This was an ancient bug. VAST’s main thread sets the ‘done’ flag to be later read by the ‘delay’ thread (estimer.c
) and then continues to execute. Sometimes, a rare condition appears when the main thread is done shutting down and cleans the ‘platform globals’ struct where the ‘done’ flag is located. Thus, by the time the ‘delay’ thread attempts to read the ‘done’ flag, it’s actually reading already freed memory causing a segfault. The easy fix is to wait for the ‘delay’ thread to finish after ‘done’ flag is set, but there are of course other possible fixes too. In any case, the bug and fix itself was not the purpose of this blog post, but rather to show how we approached the problem and some tips with GDB.
Conclusion
Reproducing bugs and debugging them is not usually as easy as one would like. In this particular case, it was a learning experience for me how to run a program multiple times from within GDB.
Top comments (0)