DEV Community

Cover image for Fixing a fifteen-year-old curve fit bug
Paul Cochrane šŸ‡ŖšŸ‡ŗ
Paul Cochrane šŸ‡ŖšŸ‡ŗ

Posted on • Originally published at peateasea.de on

Fixing a fifteen-year-old curve fit bug

Originally published on peateasea.de.

The backwards compatibility of Perl software is wonderful. Thatā€™s why itā€™s all the more jarring when you find a package that doesnā€™t work. This is the story of a 15-year-old bug that I managed to track down and, fortunately, resolve.

Yā€™know, I hate it when software doesnā€™t ā€œjust workā€; both as a user and as a developer. Usually, something like this distracts me and I want to work out what the problem is before I can continue with whatever I was doing beforehand. Sometimes, diving down the rabbit hole doesnā€™t lead anywhere and I have to find a workaround and move on. Other times I get lucky and I work out what the bug was and can fix it. This is one of those lucky times.

Thereā€™s something magical about diving into solving a bug. Iā€™m able to find real focus, time flies and I come out the other end tired yet invigorated. I really enjoy these kinds of bug-hunting puzzles and I seem to be fairly good at it.1 But maybe Iā€™m just overly stubborn (in a good way!).

I like to understand why something doesnā€™t work and then see if I can find a solution. Also, if I find a solution, I like to understand why a solution works. After all, sometimes while poking around, I stumble on something that suddenly works. Yet I find in such cases that itā€™s always good to understand why it works. That way I can be more sure that the solution really does solve the underlying problem.

As I mentioned above, one of the great things about Perl is that often code from a decade or so ago ā€œjust worksā€. The Perl people put a lot of effort into ensuring backwards compatibility. Thatā€™s why itā€™s so surprising when something doesnā€™t work, as with the situation I recently found myself in.

The backstory

So how did I get into this situation? Well, I wanted to run some old code of mine to process Arctic sea-ice extent data and see what the results look like now. To do that, I needed to install the projectā€™s dependencies. Unfortunately, running

$ cpanm --installdeps .
Enter fullscreen mode Exit fullscreen mode

failed. Specifically, the distribution2 Algorithm::CurveFit (which I use to fit a curve to the sea-ice extent data) failed its tests and hence wouldnā€™t install. Weird. That used to work without a problem. Hrm.

I checked the distā€™s info on meta::cpan, but it hadnā€™t been updated since 2010. In other words, there hadnā€™t been any recent changes to cause the failure. Since Iā€™d written my program in 2014 and hadnā€™t changed it since 2018, Iā€™d expected everything to still run as normal. Also weird.

I browsed through the code on meta::cpan and found that it was plain Perl: it didnā€™t use any XS or anything which looked special. In other words, this should still ā€œjust workā€.3 More weird.

Then I found out that this has been a known problem since 2017 and that there was a ticket for the issue on RT. The gist of the ticket was that the dist had stopped working on Debian stretch, whereas it used to work on Debian jessie.

Hrm, interesting. I wonder why?

I was starting to get pulled into the bugā€™s curiosity vortex. I thought, ā€œIā€™ll have a quick look at this, maybe I can spot somethingā€. Famous last words. Down the rabbit hole I plungedā€¦

Rebuilding the past

So where to start? My first instinct was to try and recreate the past. After all, the dist used to pass its test suite on Debian jessie with Perl 5.18 (the last Perl version Iā€™d used). Thus, if I could see it working in the old environment, maybe I could spot where things had changed. This would give a working reference system and a baseline so I could see what ā€œcorrectā€ looked like.

To recreate this past state, I spun up a Vagrant virtual machine with Debian jessie. Hereā€™s the configuration I used in my Vagrantfile:

  config.vm.define "bare-test-jessie" do |bare_test_jessie|
    bare_test_jessie.vm.box = "debian/jessie64"
    bare_test_jessie.vm.network "private_network", ip: "192.168.23.107"
    bare_test_jessie.vm.network "forwarded_port", guest: 22, host: 2207, id: "ssh"
  end
Enter fullscreen mode Exit fullscreen mode

Why did I use 107 at the end of the private network address and 2207 for the forwarded port? Well, this is simply the seventh entry in my Vagrantfile. Thereā€™s nothing special about the number other than to keep my Vagrant VMs separate. Note that itā€™s not necessary to specify the forwarded port; I do so to avoid any port conflicts when running multiple VMs.

Spinning up the new VM was a simple matter of:

$ vagrant up bare-test-jessie
Enter fullscreen mode Exit fullscreen mode

Upon startup, I got this error:

The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!

            apt-get -yqq update
            apt-get -yqq install rsync

Stdout from the command:

WARNING: The following packages cannot be authenticated!
  rsync

Stderr from the command:

W: Failed to fetch http://security.debian.org/dists/jessie/updates/main/source/Sources 404 Not Found [IP: 151.101.194.132 80]

W: Failed to fetch http://security.debian.org/dists/jessie/updates/main/binary-amd64/Packages 404 Not Found [IP: 151.101.194.132 80]

W: Failed to fetch http://httpredir.debian.org/debian/dists/jessie/main/source/Sources 404 Not Found [IP: 199.232.190.132 80]

W: Failed to fetch http://httpredir.debian.org/debian/dists/jessie/main/binary-amd64/Packages 404 Not Found [IP: 199.232.190.132 80]

E: Some index files failed to download. They have been ignored, or old ones used instead.
E: There are problems and -y was used without --force-yes
Enter fullscreen mode Exit fullscreen mode

This is often the first stumbling block when working with an old system: the APT repository is no longer available from the original address as listed in the VM. I.e. in this case, thereā€™s no entry for jessie at the URLs http://httpredir.debian.org/debian or http://security.debian.org/. Thereā€™s also the related problem that APT repositories no longer use plain HTTPā€“they have to use HTTPS these daysā€“but thatā€™s not the underlying issue here.

Fortunately, the APT repository is still available as part of the Debian distribution archives. This means I only needed to update the URLs in /etc/apt/sources.list to point to the archive.4

Although the error looked bad (with a wall of red text) the VM was up; rsync had failed to install, nothing more. Other than that, everything was running. Thus, I could ssh into the VM and set things up further:

$ vagrant ssh bare-test-jessie
Enter fullscreen mode Exit fullscreen mode

Opening up vi5 on /etc/apt/sources.list

$ sudo vi /etc/apt/sources.list
Enter fullscreen mode Exit fullscreen mode

I changed the content to look like this:

deb http://archive.debian.org/debian/ jessie main non-free contrib
deb-src http://archive.debian.org/debian/ jessie main non-free contrib
deb http://archive.debian.org/debian-security/ jessie/updates main non-free contrib
deb-src http://archive.debian.org/debian-security/ jessie/updates main non-free contrib
Enter fullscreen mode Exit fullscreen mode

Now it was possible to update the package list and install some software:

$ sudo apt update
$ sudo apt -y upgrade
$ sudo apt -y --force-yes install vim curl build-essential
Enter fullscreen mode Exit fullscreen mode

Note that the --force-yes option was necessary because some of the packages canā€™t be authenticated due to their age. This is another one of those things one has to deal with when working with legacy systems.

I installed the vim package because, although vim was installed, it only ran in vi compatibility mode. I needed the vim command because thatā€™s what my muscle memory expects. Old habits die hard.

I also needed curl so that I could do the curl <url> | bash dance to install perlbrew. The build-essential package was, erm, essential so that I could build Perl from source.

Installing perlbrew and setting it up looked like this:

$ curl -L https://install.perlbrew.pl | bash
$ source ~/perl5/perlbrew/etc/bashrc
$ echo 'source ~/perl5/perlbrew/etc/bashrc' >> ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

Installing Perl looked like this:

$ perlbrew --notest install perl-5.18.4
$ perlbrew switch 5.18.4
$ perlbrew install-cpanm
Enter fullscreen mode Exit fullscreen mode

I chose to install Perl 5.18.4 because that was the most recent version Iā€™d used in my program way back in 2018. The reasoning being that if Algorithm::CurveFit had worked with that version back then, it should work with that version in a Debian jessie VM now.

Unfortunately, the cpan/Time-HiRes/t/nanosleep.t test fails (the only one to fail in the entire Perl test suite); hence itā€™s necessary to use the --notest option for the installation to complete successfully.

After switching to the newly-installed Perl version, and installing cpanm in that environment, I was ready to try installing Algorithm::CurveFit to see if it works in my rebuild of the past:

$ cpanm Algorithm::CurveFit
Enter fullscreen mode Exit fullscreen mode

Which worked! I.e. it built, tested and installed as expected. This was my baseline test: now I had a system configuration in which Algorithm::CurveFit worked (meaning, its test suite passed).

Now I needed to be able to dig around in the code and change things. To do that I needed to get the source tarball and unpack it:

$ wget https://cpan.metacpan.org/authors/id/S/SM/SMUELLER/Algorithm-CurveFit-1.05.tar.gz
$ tar -xvzf Algorithm-CurveFit-1.05.tar.gz
Enter fullscreen mode Exit fullscreen mode

Changing into the directory created by tar, building the Makefile, and running the test suite from the source tarball, I was able to make sure that this code also worked as I expected:

$ cd Algorithm-CurveFit
$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Algorithm::CurveFit
Writing MYMETA.yml and MYMETA.json
$ make test
cp lib/Algorithm/CurveFit.pm blib/lib/Algorithm/CurveFit.pm
PERL_DL_NONLAZY=1 "/home/vagrant/perl5/perlbrew/perls/perl-5.18.4/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00pod.t ........ skipped: Test::Pod 1.00 required for testing POD
t/00podcover.t ... skipped: Test::Pod::Coverage 1.00 required for testing POD coverage
t/01basic.t ...... ok
t/02bad_deriv.t .. ok
All tests successful.
Files=4, Tests=20, 2 wallclock secs ( 0.04 usr 0.03 sys + 1.30 cusr 0.09 csys = 1.46 CPU)
Result: PASS
Enter fullscreen mode Exit fullscreen mode

It wasnā€™t necessary to install the dependencies in this step, because the cpanm Algorithm::CurveFit call from earlier had already installed them.

To make absolutely sure that the test which fails ā€œin the futureā€ still worked on Debian jessie (and to ensure that I was using the local library, and not the one Iā€™d installed globally via cpanm earlier),6 I ran the test via perl directly:

$ perl -Ilib t/02bad_deriv.t
1..13
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7
ok 8
ok 9
ok 10
ok 11
ok 12
ok 13
Enter fullscreen mode Exit fullscreen mode

This was all the detail I could get at this point. I.e. the test suite could only tell me that the tests had passed.

Even so, now with a solid, working baseline reference system I was able to move forward and try to work out why the code didnā€™t work in the present day.

A quick peek before digging in

So, with a working reference system, what to do now? Simply staring at the code and mulling over what it does wasnā€™t a good strategy at this stage because Iā€™d not yet interacted with the code and seen how information flows through the system. Still, having a quick skim read of the code was a good idea, just to get a feel for the codeā€™s structure and generally to get a birdā€™s eye view before looking for places to dig deeper.

Looking through the code, the dist consists of a single Perl module with about 250 lines of code excluding docs and comments. The module calculates a best fit of a given function to an input set of data based upon an initial guess of the functionā€™s coefficients. The code seemed quite intricate, yet it was well commented and 250 lines of code seemed manageable to understand.

Poking the system to gain a feel for the software

Ok, so what does the test failure look like on ā€œlater-than-jessieā€ systems?

$ perl -Ilib t/02bad_deriv.t
1..13
ok 1
ok 2
not ok 3
# Failed test at t/02bad_deriv.t line 50.
ok 4
ok 5
ok 6
ok 7
ok 8
ok 9
ok 10
ok 11
ok 12
ok 13
# Looks like you failed 1 test of 13.
Enter fullscreen mode Exit fullscreen mode

Unfortunately, thatā€™s not very informative. Looking in the RT ticket, SREZIC mentions that itā€™s possible to get a bit more information about why the tests in 02bad_deriv.t fail by using cmp_ok. Taking his advice, I appended these tests:

  cmp_ok($v + $eps[$par], ">", $val[$par]);
  cmp_ok($v - $eps[$par], "<", $val[$par]);
Enter fullscreen mode Exit fullscreen mode

and ran the test file again (perl -Ilib t/02_bad_deriv.t), getting this output:

not ok 5
# Failed test at t/02bad_deriv.t line 52.
# '5'
# >
# '5.15'
Enter fullscreen mode Exit fullscreen mode

As my Masterā€™s supervisor used to ask me quite a lot: ā€œYa good, but what does it mean?ā€. Good question. Dunno. Yet.

Looking at the test code more closely, we have this snippet:

my @val = (5.15, 0.01, 0.100, 1.045);
my @eps = (1.00, 0.10, 1.000, 0.100);

foreach my $par (0..$#parameters) {
  my $v = $parameters[$par][1];
  ok(defined $v);
  ok($v + $eps[$par] > $val[$par]);
  ok($v - $eps[$par] < $val[$par]);
  cmp_ok($v + $eps[$par], ">", $val[$par]);
  cmp_ok($v - $eps[$par], "<", $val[$par]);
}
Enter fullscreen mode Exit fullscreen mode

which includes the newly added cmp_ok calls.

This loops over a set of parameters and compares their calculated values (in the @parameters array defined earlier in the test) with their expected values (in the @val array) to within a given tolerance for each (specified in the @eps array). What the test failure output is telling us is that we expect the value for the first element in the @val array to be greater than 5.15, but its actual value plus its tolerance (the first element of the @eps array) is 5. Since the first value of the @eps array is 1.00, the calculated value must be 4. Checking the initial guess of what the parameters should be (from earlier in the test)

my @parameters = (
                ['s', 4.0, 0.01],
                ['c', 0.01, 0.001],
                ['y0', 0.1, 0.001],
                ['x0', 1.0535, 0.01],
              );
Enter fullscreen mode Exit fullscreen mode

we see that the s parameter is initially set to 4.0, which suggests that this is the parameter not achieving its expected value. Was I 100% sure? Well, no. I like to be very sure that Iā€™ve understood everything properly before moving onwards, so itā€™s best to have a look at the parameters. Fortunately, the author had left some debugging code in the test:

#use Data::Dumper;
#print Dumper \@parameters;
Enter fullscreen mode Exit fullscreen mode

Reactivating this (i.e. uncommenting it) and running the tests again, I got:

$ perl -Ilib t/02bad_deriv.t
1..13
$VAR1 = [
          [
            's',
            4,
            '0.01'
          ],
          [
            'c',
            '0.0471889540880574',
            '0.001'
          ],
          [
            'y0',
            '0.1',
            '0.001'
          ],
          [
            'x0',
            '1.0535',
            '0.01'
          ]
        ];
Enter fullscreen mode Exit fullscreen mode

which shows the parameter values after having calculated the fit.

So there ya go: the first parameter is called s, its final value is 4 (as I calculated above) and its accuracy is 0.01.7

Staring at the test code I realised that the @parameters array must be updated in-place. It didnā€™t make any sense to set the @parameters to an initial guess and then expect (as part of the test assertions) that its values had changed after fitting the curve. That was a good thing to know and a piece of the puzzle to set aside for later. I might not need it, but ya never know.

So what was this output telling me? The main point was that s wasnā€™t changing at all: it had the value of 4 both before and after running the curve fitting algorithm.8 Thatā€™s weird because in the Debian jessie code, we know that the test passes, which means that its final value should be close to 5.15. Why doesnā€™t this value change anymore?

A numerical stability problem?

Iā€™d like to say that I used a very focused and analytical approach to how I continued, but I canā€™t: I turned to plain old print debugging.

Somehow I feel ashamed to admit that Iā€™d simply used print. The thing is, itā€™s easy to use, itā€™s flexible, and it allows me to see how the values Iā€™m interested in evolve at a single glance. What can I say? Data::Dumper is your friend!

Staring at the code again, I realised that it iteratively updates the parameters of the function to fit within a main while loop. This seemed to be the best place to look for anything funny going on. Thus, I started sprinkling print Dumper $<variable_name>; statements in various places, while comparing the output from the jessie system and from my bullseye system (where the tests were failing).

I got lucky: one thing stuck out. The @cols variable used as input to Math::MatrixReal->new_from_cols(), which creates a

matrix of datapoints X parameters

had a very large number turn up in its output. Like, HUGE. Where other numbers in the array were sitting between roughly -2.0 and 2.0, one number was 9.22229939811846e+20.9 Having done a fair bit of numerical computational work as part of my PhD and later while working on simulation systems and scientific computing, I smelled a numerical precision issue floating10 around somewhere. I noticed that part of the code used derivatives, hence I guessed two numbers on a denominator somewhere were getting too close to one another, causing the value in the @cols array to explode.

Ok, so if weā€™re calculating derivatives, what are we calculating derivatives of? The test tries to fit the following formula to the input data:

formula => 'sqrt( s*(x-x0)^2 + c) + y0'
Enter fullscreen mode Exit fullscreen mode

Formatting this a bit more nicely, we have:

f(x)=s(xāˆ’x0)2+c+y0 f(x) = \sqrt{ s(x-x_0)^2 + c} + y_0

What does this function look like? Seeing a functionā€™s form when plotted can give a feel for how it behaves. I defined the function in the Sage mathematics software system and plotted it

# initial guess parameters
s = 4
y0 = 0.1
x0 = 1.0535
c = 0.01

# the fit function
f = sqrt(s*(x-x0)^2 + c) + y0

# plot the function and save it to file
p = plot(f, (x,-2, 2), axes_labels=['$x$', '$y$'])
p.save('algorithm-curvefit-bug-fit-function.png')
Enter fullscreen mode Exit fullscreen mode

which produced this image:

Algorithm::CurveFit bug fit function with initial parameters

Hrm, this function is ā€œinterestingā€ just past x=1x = 1 . As the test filename suggests, this could be a function with a ā€œbad derivativeā€.

Of course, now the question is: whatā€™s the derivative? Again, using Sage, we can work this out:

# calculate the derivative of the fit function
f.derivative(x)
=> s*(x - x0)/sqrt(s*(x - x0)^2 + c)
Enter fullscreen mode Exit fullscreen mode

which, written more nicely, looks like this

fā€²(x)=s(xāˆ’x0)s(xāˆ’x0)2+c f'(x) = \frac{s (x - x_0)}{\sqrt{s (x - x_0)^2 + c}}

Plotting this function over the same range, we have

Algorithm::CurveFit bug derivative of fit function with initial parameters

Yup, this definitely has a funky derivative, especially just above x=1x = 1 . Now I was more sure that this is why the test file has the name it does. This indicates that around this value, we might be getting numerical instabilities. I decided to look closer.

Letting x=x0x = x_0 in the derivative fā€²(x)f'(x) , we get

fā€²(x0)=s(x0āˆ’x0)s(x0āˆ’x0)2+c=0 f'(x_0) = \frac{s (x_0 - x_0)}{\sqrt{s (x_0 - x_0)^2 + c}} = 0

In other words, this is where the derivative crosses the yy axis. The only numerically interesting thing is happening in the denominator, which is

s(xāˆ’x0)2+c \sqrt{s (x - x_0)^2 + c}

but setting x=x0x = x_0 here only gives

c \sqrt{c}

which is nonzero (and not particularly small), so I didnā€™t think this would be causing any nasty numerical problems. Sure, the derivative changes a lot around x=x0x = x_0 , but it doesnā€™t seem like anything which could be causing a numerical instability.

Not being satisfied with the numerical instability explanation, I decided to look at the code more closely. There is a chunk of code close to the beginning of the main while loop which finds the value of the derivative for each xx point of the function to fit and eventually adds these values to the @cols array. This chunk of code is split into two blocks: the first calculates the derivative if the $deriv variable is a CODE reference, and the second otherwise. Both blocks have a section where the code falls back to a numerical method if the derivative isnā€™t a defined value.

When running the test t/02bad_deriv.t, only the second block is executed. I found that thereā€™s one case where the fallback is triggered: when the xx value from the input data is equal to x0x_0 (i.e. where the derivative changes the most). To calculate a value when the derivative is undefined, the code uses a closure called $formula_sub which (as far as I could tell) returns the value of the function to fit at this xx value,11 thus providing a substitute for the undefined derivative value.

When the $function_sub closure calculates the function to fit at x0, we can simplify the function f(x)f(x) to

f(x0)=c+y0 f(x_0) = \sqrt{c} + y_0

which is non-zero and seems well-behaved. Again, this doesnā€™t look like a numerical issue.

I then investigated the values that $function_sub returns as well as its input parameter values. I found that the enormous value appears exactly when x is equal to x0. Given the input parameter values and the known value for x, I calculated the value $function_sub should return. Using the above expression for f(x0)f(x_0) and the values of c and y0, I got:

f(x0)=0.01+0.1=0.1+0.1=0.2 f(x_0) = \sqrt{0.01} + 0.1 = 0.1 + 0.1 = 0.2

This shouldnā€™t be hard for a computer to calculate. Hence I think I can stomp on the numerical instability hypothesis and will have to find a different cause for the anomalous large numbers.

Structure in data and code points the way

Wild goose chases arenā€™t exactly uncommon when debugging code. But then, serendipity also plays a role. So it was in this case for me.

While ā€œprint debuggingā€ the parameter values passed into $function_sub, I noticed a difference in structure between separate calls. Sometimes the structure looked like this:

$VAR1 = 'x';
$VAR2 = '1.0535';
$VAR3 = 's';
$VAR4 = [
          's',
          4,
          '0.01'
        ];
$VAR5 = 'c';
$VAR6 = [
          'c',
          '0.0467904042605691',
          '0.001'
        ];
$VAR7 = 'y0';
$VAR8 = [
          'y0',
          '0.1',
          '0.001'
        ];
$VAR9 = 'x0';
$VAR10 = [
           'x0',
           '1.0535',
           '0.01'
         ];
Enter fullscreen mode Exit fullscreen mode

and sometimes like this:

$VAR1 = 'x';
$VAR2 = '0.7199';
$VAR3 = 's';
$VAR4 = 4;
$VAR5 = 'c';
$VAR6 = '0.0467904042605691';
$VAR7 = 'y0';
$VAR8 = '0.1';
$VAR9 = 'x0';
$VAR10 = '1.0535';
Enter fullscreen mode Exit fullscreen mode

Looking back, this should have made some warning bells go off. Perhaps I was tired. All I thought at the time was, ā€œhrm, thatā€™s oddā€. However, the thought was not lost. It got stored somewhere in the back of my mind, ready to make pennies drop later.

Digging through the rest of the code, I happened to notice that $formula_sub was called in other locations, this time not involved in derivative calculations. In this other situation, it assisted in the calculation of the residuals.

Hereā€™s the code for comparison, once embedded in the derivative calculation:

$value = $formula_sub->($xv, @parameters)
Enter fullscreen mode Exit fullscreen mode

and again in the residuals calculation:

my @beta =
  map {
    $ydata[$_] - $formula_sub->(
        $xdata[$_],
        map { $_->[1] } @parameters
      )
  } 0 .. $#xdata;
Enter fullscreen mode Exit fullscreen mode

The second instance isnā€™t that clear, so removing some of the surrounding code and reformatting it to be on one line, we get

$formula_sub->($xdata[$_], map { $_->[1] } @parameters)
Enter fullscreen mode Exit fullscreen mode

This passes the current value of x and the functionā€™s parameter values as a list of arguments into $formula_sub. The parameter values are stored in the second element of the sub-array within each element of the @parameters array, hence the $_->[1} bit.

I noticed that this had the same rough form as the derivative calculation code

$value = $formula_sub->($xv, @parameters)
Enter fullscreen mode Exit fullscreen mode

Hang on! What was that map doing in there?

And then something clicked. The difference in how $formula_sub was called was the reason for the difference in parameter structure within $formula_sub that Iā€™d noticed earlier.

I then looked again at the parameter values appearing in $formula_sub. Instead of a value being associated with each parameter, there was an array containing the parameterā€™s metadata being associated with each parameter. Not only is this doubled up, itā€™s fairly obviously wrong.

With this new perspective, I realised that this was the underlying problem: the parameters werenā€™t being passed to $function_sub correctly when calculating the fallback derivative. What was happening was that the @parameters array was being concatenated with the scalar value of the current x value. Thatā€™s not correct: $formula_sub expects only a list of argument values, not an array-of-arrays concatenated onto a scalar.

What I guessed was probably happening (and now Iā€™m getting into uncharted territory) is that a memory reference value was being passed to $formula_sub for one of the parameter values instead of the actual parameter value. A memory reference value could be any weird value and it wouldnā€™t surprise me if this was the source of the huge number I was seeing.

Striving for a deeper explanation

Wait a minute. If this is the cause of the approximately 1e+20 number, and the code is the same now as it was on the jessie system, that means this issue should also exist on the jessie system. Hrm.

Looking at the $function_sub value at x == x0 on my rebuilt jessie system, there it was: a huge number. This time it was 216930334118.067,9 which is of the order of 1e+11. This means the issue has been lurking since (at least) 2010.

So whatā€™s going on here? Why does the code work on the older system but not on the newer one?

My attempt at an explanation is this: on the jessie system, 1e+11 wasnā€™t large enough to stop the fitting algorithm from converging to a sensible solution. However, when this number is as large as 1e+20, it dominates so much (as one would expect: itā€™s 9 orders of magnitude larger than that on the jessie system) that the fitted solution never deviates from the initial guess. I also found that increasing the maximum number of iterations doesnā€™t help convergence to the correct value, because the solution ā€œconvergesā€ after only 3-4 iterations.

Of course, now the question becomes: why is the outlier function value on jessie approximately 1e+11 but on later systems it has jumped to approximately 1e+20?

My best guess is that it might be due to a change of integer precision in something, somewhere between jessie and stretch. For instance, consider that the maximum integer values for 32-bit and 64-bit are:

MAX_INT 32 bit: 2,147,483,647 => 2e+09
MAX_INT 64 bit: 9,223,372,036,854,775,807 => 9e+18
Enter fullscreen mode Exit fullscreen mode

Note that both of these numbers are two orders of magnitude smaller than the large numbers I was seeing on jessie and later systems, respectively. Could we just be seeing MAX_INT*100 being converted into a floating point value? I honestly donā€™t know, and this is where the trail becomes hard to follow.

There are now more questions than I think I can definitively answer:

  • does this point to a libc change? The main C library libc changed from 2.19 to 2.24 between jessie and stretch.
  • is this due to the GCC version change? This went from 4.9.x in jessie to 6.x in stretch.
  • in the change from GCC 4 to GCC 5 (which was on the way to GCC 6 in stretch), the default mode for C was changed from -std=gnu89 to -std=gnu11, which is a pretty large jump! Maybe this is a cause?

In each case, I couldnā€™t find any evidence for something like an integer precision change. Of course, that doesnā€™t mean it doesnā€™t exist; it merely means I didnā€™t find anything. At the end of the day, I couldnā€™t find a reason and can only speculate that where once a 32-bit integer was used, now a 64-bit integer is the default, hence causing the large change between the outlier values I observed.

What is interesting, I find, is that the outlier numbers are very similar to the respective MAX_INT value on each system. By this, I mean that on my jessie test system, the huge outlier number (as a string of numbers) looked like approximately MAX_INT32*100 each time I ran it. The same occurred on stretch and later systems: the outlier consistently looked like MAX_INT64*100. Itā€™s a hint that I might be on the right track with my guess about an integer precision change, but itā€™s only that: a hint.

Avoiding further speculation on the exact cause of this change, I decided to put it into the ā€œtoo hard basketā€, accept it, and move on.

Bug fix

Whatā€™s great about the analysis I made earlier is that Iā€™ve also stumbled upon the correct way to pass parameters into $function_sub and thus the fix is rather simple. Changing the calls to $function_sub where the derivatives are calculated from

$value = $formula_sub->($xv, @parameters)
Enter fullscreen mode Exit fullscreen mode

to

$value = $formula_sub->($xv, map { $_->[1] } @parameters)
Enter fullscreen mode Exit fullscreen mode

resolves the problem, and the test suite now passes. Yay! šŸŽ‰

Also, $formula_sub now returns the value 0.2 when x == x0 as we calculated at the end of the ā€œA numerical stability problem?ā€ section. Also nice.

Thatā€™s it. Thatā€™s the bug fix. Quite an anti-climax in the endā€¦

First contact

Since the distā€™s author hadnā€™t been active on this project since roughly 2010, Iā€™d guessed that heā€™d moved on and the email address probably wasnā€™t active anymore. I mean, this is understandable: itā€™s open-source software and maintainers have lives and end up doing other things. On the off chance that emails do still get through, I sent him a message ā€¦ and he replied! I was totally stoked!

After having a thorough look around, unfortunately he was unable to find any old source code repositories on his computers at home. The original source code repository used to be http://svn.ali.as/cpan/trunk/Algorithm-CurveFit/, which no longer exists. Thus we thought that the only code available now would be the tarball on CPAN. Consequently, he created a GitHub repo for Algorithm::CurveFit making commits corresponding to each version number. This was the best solution to create a repo for what looked like lost information.

Later, he found out that the original repository was buried within the historical archive of Adam Kennedyā€™s Open Repository, which is hosted on GitHub. Thus, weā€™ve been able to find the original code with commits. This is brilliant as then I can do a bit of repository archaeology and try to work out what the author was thinking when he implemented various things.

As yet, heā€™s not had time to extract the commits from the Algorithm-CurveFit subdirectory of the TheOpenRepository repository into the new p5-Algorithm-CurveFit repo. Doing so would be nice as it would recreate the state of the original repository. Interestingly enough, I blogged about this process a few years ago. The short answer about how to do this is to use git subtree split.

Also, heā€™s very kindly given me COMAINT on the project. So, as soon as the bug fix is reviewed, Iā€™ll roll out a bug-fix release.

Then I can get my old project back up and running again. šŸ˜„

Open question: the five-point stencil

Although Iā€™m pretty sure Iā€™ve stomped on the bug that sent me on this journey, I now have more questions than when I started. Iā€™d hoped the original repoā€™s commit messages would answer some, but it looks like the information has been lost to the winds of time.

One thing in particular that interests me is the five-point stencil calculation mentioned in the code. Was this ever implemented? From the code, it looks like it was started but never finished. The module was released as if it had been implemented, because this concept is mentioned in the CHANGELOG. Asking the author about this he too thinks that, from looking at the code, it wasnā€™t implemented. Of course, details are scarce because itā€™s a been long time since he wrote the module.

The author mentioned a five-point stencil fix for Algorithm::CurveFit in a bug report in the Math::Symbolic dist, so it seems that the implementation was intended. Given that the $h and $t variables are assigned but never used around where one would expect a five-point stencil calculation, it could be that this was implemented locally but somehow never made it into the release that mentions it.

It seems that the function value returned from $function_sub when calculating the derivative should actually be turned into a derivative via the five-point stencil method. I guess my worry is that Iā€™ve solved one bug only to have uncovered another, potentially deeper issue. In other words, in certain corner cases, the module calculates incorrect values. That would be bad.

Itā€™s also entirely possible that this was merely a simple oversight and now Iā€™ve got the opportunity to fix it. That would be nice šŸ™‚

W(h)ither software?

While working on this I had all kinds of philosophical thoughts. For instance: how to maintain such projects? Should we just let them wither on the vine and die? Should we fork them? At the end of the day, who takes care of our ageing software?

I donā€™t have answers to these questions, but I think that software that people are using should at least be able to be maintained and kept up to date. Fortunately, open-source software allows anyone to help out, meaning that itā€™s at least possible to keep small projects like the one discussed here alive and kicking.

Wrapping up

So thatā€™s it! Another bug found, another bug squashed. Thanks for coming along for the ride!

  1. Iā€™m available for freelance Python/Perl backend development and maintenance work. Contact me at paul@peateasea.de and letā€™s discuss how I can help solve your businessā€™ hairiest problems. ā†©

  2. Often, the words ā€œmoduleā€, ā€œdistributionā€ and ā€œdistā€ (short for ā€œdistributionā€) are used interchangeably. Yet, a ā€œdistributionā€ is the term for one or modules bundled together. Often, a distribution contains a single module, hence why the terms get used interchangeably. ā†©

  3. Iā€™m beginning to sound like a cracked record, now arenā€™t I? ā†©

  4. I originally spotted this tip in the Debian forums. ā†©

  5. Although the vi command exists and actually runs vim, this is only vim-tiny, which runs vim in vi compatibility mode. Thus the vim command doesnā€™t exist and to be able to use it one has to install the vim package after first fixing the sources.list and doing the update/upgrade dance. ā†©

  6. I did mention that Iā€™m stubborn, didnā€™t I? šŸ˜‰ ā†©

  7. The meaning of these elements I gleaned from the moduleā€™s POD. ā†©

  8. I realised much later that almost none of the parameters had changed from their initial guesses: only c had been updated slightly. ā†©

  9. The exact number changes with each run of the test suite. ā†© ā†©2

  10. Pun not intended. ā†©

  11. Note that this is rather than the value of the derivative. This is odd because in the context where this is called, the code expects a derivative. ā†©

Top comments (0)