Advent of Code, the second half

#rstats #r #adventofcode

So Advent of Code 2019 ended last week, and I got all 50 stars. The challenges became considerably more challenging compared to the first half, but base R did allow acceptably efficient solutions in almost all cases. My code is still on GitHub - here’s what I learned about R by writing it! 🎄

Tail recursion grows the stack

I kind of knew this already, but while R’s functional style invites using recursive functions, you will run into trouble if you recurse too deep. The simple reason is that each recursion will add to the call stack even when the recursion is the last expression of the function body (tail position). Many programming languages optimise for this situation to prevent stack growth, but R does not.

In particular, naive recursive solutions to the maze puzzles of Day 18 and Day 20 often led me into the dreaded Error: C stack usage XXXXXXX is too close to the limit message. There were hacky ways around this which allowed me to get the 🌟🌟 anyway, but the real solution I eventually found in this excellent blogpost by Alain Dipert. It’s called the trampoline and it implements recursion as iteration. I cannot possibly do a better job than Alain explaining it, so go and read his article!

Related to this, using deep recursion on several of the Advent of Code days made it apparent that creating unnecessary temporary variables in the function body will grow the size of each execution environment, and thereby the total memory usage of R.

Consider:

countdown <- function(n) {
    if (n > 0) {
        tmp <- runif(10000)
        countdown(n - 1)
    } else {
        gc()
    }
}

countdown(100)

## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 543979 29.1 1233580 65.9 641291 34.3
## Vcells 2017084 15.4 8388608 64.0 2536825 19.4

countdown(1000)

## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 550718 29.5 1233580 65.9 641291 34.3
## Vcells 11018937 84.1 15315970 116.9 11103124 84.8

Comment out the tmp assignment to see that it is indeed tmp which causes the difference.

So why use recursion at all in R, if it requires all these extra steps and considerations? For me, it is simply a better fit to the way my brain works when parsing a problem, and I am less prone to making mistakes as compared to using while/for flow control with constantly updated variables. As a bonus, recursive functions are also easier to unit test!

<<- can be very convenient

In my professional coding life I have rarely used the <<- operator, which allows you to assign a variable in the enclosing environment of a function. Because of R’s lexical scoping, this is always the environment in which the function was defined, not the one in which it was called. But unfortunately <<- causes functions to have side-effects, making code using <<- hard to read and to debug.

Yet in Advent of Code, I used it liberally. Often as a timesaver, so I didn’t have to pass variables around as function arguments. But on Day 14, it became also a lifesaver for keeping a common state between recursive function calls.

Consider this example:

countdown_with_offshoot <- function(n) {
    if (n > 0) {
        if (sum(vals) > 100) {
            vals <<- sample(vals, floor(length(vals) / 2))
            countdown_with_offshoot(sum(vals > 1))
        } else {
            vals <<- c(vals, sample(10, 1))
        }
        countdown_with_offshoot(n - 1)
    }
}

vals <- numeric(0)
countdown_with_offshoot(100)

This will sometimes launch an additional countdown based on numbers generated inside the recursive function calls. Because of lexical scoping, the enclosing environment is the same no matter how deep the recursions run, and <<- will therefore always assign to the same variable.

Can this be solved without <<-? Of course. But you will need to both pass vals along with the countdown itself, and return it when the offshoot countdown ends. And every function execution environment will keep its own state of vals in memory. <<- simplifies the code greatly and reduces memory use. For a one-off coding puzzle, that is a more than acceptable trade-off!

R does not support large integers

Day 22 was a mathematical puzzle requiring modular exponentiation of large integers. To my surprise, I was unable to solve it in base R by implementing modular exponentiation myself, because there is no support for very large integers! The integer type is 32-bit, and the double type has imperfect precision above 2**53.

## Explicitly casting to integer will yield NA
as.integer(2**53)

## Warning: NAs introduced by coercion to integer range

## [1] NA

## Below 2**53 a double represents integers accurately
(2**52 + 1:10) %% 2

## [1] 1 0 1 0 1 0 1 0 1 0

## Above 2**53 computations become innaccurate
## There is no warning for this behaviour
(2**53 + 1:10) %% 2

## [1] 0 0 0 0 0 0 0 0 0 0

I had to resort to the {gmp} package here, a wrapper around the GNU Multiple Precision Arithmetic Library library which does allow accurate computations with large integers. To install on Debian derivatives, run sudo apt-get install libgmp-dev prior to installing the R package.

(gmp::as.bigz(2**53) + 1:10) %% 2

## Big Integer ('bigz') object of length 10:
## [1] 1 0 1 0 1 0 1 0 1 0

Fortunately, {gmp} also has modular exponentiation built in, as gmp::powm(), so I could skip the manual implementation altogether!

More for the wishlist

Last time I complained about R not allowing key-value lookups using keys of any type, like Python’s dictionaries do allow (well, for immutable types). I ran into this problem multiple times again, although in the case of (x,y) coordinates I sometimes used complex numbers instead in a single data frame column, on which I could then filter to retrieve the associated values. But that is a rather limited use case.

Given that most of the Advent of Code challenges were general computing and not data science puzzles, R’s limitations in this regard were exposed as the problems became more complex. One-based indexing was sometimes slightly annoying when relying on modulus operations to access a vector, but this is unlikely to ever change in R. Immediate unpacking of multiple return values, however, would be a nice feature. The {zeallot} package implements this but it would great to see it in base R, so we don’t always need to construct and pick apart list objects.

For instance, from the documentation of {zeallot}:

library(zeallot)

c(lat, lon) %<-% list(38.061944, -122.643889)
print(lat)

## [1] 38.06194

Likewise I was often bothered by how heavy-handed it feels to define a class in R, or at least some other type of structural template for pieces of non-tabular data which belong together. When using R for data science this is rarely a problem, for general computing some syntactic elegance would be welcomed - perhaps something similar to defrecord in Clojure. In practice I usually just relied on a loosely defined list instead, but this becomes hard to manage for complex code.

Oh, and give me graphs

Around 11 out of 14 minutes of the total running time for all 25 days of my Advent of Code solutions are spent on Day 18 and Day 20, both maze pathfinding puzzles. They were solved through a combination of random walks where visited paths are tagged as such, and recursive functions. Other people, using other languages, appear to have relied on graphs a lot, where the shortest path in a maze is the shortest path across the nodes of a graph.

I could have tried to implement the main algorithms used like Dijkstra and BFS in base R, but I suspected that it would be too inefficient, without access to {Rcpp}. For implementing such algorithms natively a language like Julia would shine - not R. Now that Advent of Code is over and my self-imposed restriction to base is over though, I’d love to try and see how well {igraph} would perform in R on these particular challenges!

But that’ll be something for the new decade. HNY! 🎆