I think tail call optimizations are pretty neat, particularly how they work to solve a fundamental issue with how recursive function calls execute. Functional languages like Haskell and those of the Lisp family, as well as logic languages (of which Prolog is probably the most well-known exemplar) emphasize recursive ways of thinking about problems. These languages have much to gain performance-wise by taking advantage of tail call optimizations.
Note: I don't go into excruciating details on how tail calls work in this post. If you feel like you need/want some more background on them, here are a few awesome resources:
- The YouTube channel Computerphile has a video where they walk through examples of tail-recursive functions in painstaking detail.
- A detailed explanation on Stack Overflow on the concept of tail recursion.
With the recent trend over the last few years of emphasizing functional paradigms and idioms in the programming community, you would think that tail call optimizations show up in many compiler/interpreter implementations. And yet, it turns out that many of these popular languages don’t implement tail call optimization. JavaScript had it up till a few years ago, when it removed support for it 1. Python doesn’t support it 2. Neither does Rust.
Before we dig into the story of why that is the case, let’s briefly summarize the idea behind tail call optimizations.
How Tail Call Optimizations Work (In Theory)
Tail-recursive functions, if run in an environment that doesn’t support TCO, exhibit linear memory growth relative to the function’s input size. This is because each recursive call allocates an additional stack frame to the call stack. The goal of TCO is to eliminate this linear memory usage by running tail-recursive functions in such a way that a new stack frame doesn’t need to be allocated for each call.
One way to achieve this is to have the compiler, once it realizes it needs to perform TCO, transform the tail-recursive function execution to instead use something more akin to an iterative loop in the background. The effect of this is that the result of the tail-recursive function is calculated using just a single stack frame. Ta-da! Constant memory usage.
With that, let’s get back to the question of why Rust doesn’t exhibit TCO.
Going Through the Rust Wayback Machine
The earliest references to tail call optimizations in Rust I could dig up go all the way back to the Rust project’s inception. I found this mailing list thread from 2013, where Graydon Hoare enumerates his points for why he didn’t think tail call optimizations belonged in Rust:
That mailing list thread refers to this GitHub issue, circa 2011, when the initial authors of the project were grappling with how to implement TCO in the then-budding compiler. The heart of the problem seemed to be due to incompatibilities with LLVM at the time; to be fair, a lot of what they’re talking about in the issue goes over my head.
What I find so interesting though is that, despite this initial grim prognosis that TCO wouldn’t be implemented in Rust (from the original authors too, no doubt), people to this day still haven’t stopped trying to inplement TCO in the Rust compiler.
Subsequent Proposals for Adding TCO into the Rust Compiler
In May of 2014, this PR was opened, citing that LLVM was now able to support TCO in response to the earlier mailing list thread. More specifically, this PR sought to enable on-demand TCO by introducing a new keyword become
, which would prompt the compiler to perform TCO on the specified tail recursive function execution.
Over the course of the PR’s lifetime, it was pointed out that the Rust compiler could, in certain situations, infer when TCO was appropriate and perform it 3. The proposed become
keyword would thus be similar in spirit to the unsafe
keyword, but specifically for TCO. Needless to say, this PR didn't end up getting accepted and merged.
A subsequent RFC was opened in February of 2017, very much in the same vein as the previous proposal. Interestingly, the author notes that some of the biggest hurdles to getting tail call optimizations (what are referred to as “proper tail calls”) merged were:
- Portability issues; LLVM at the time didn’t support proper tail calls when targeting certain architectures, notably MIPS and WebAssembly.
- The fact that proper tail calls in LLVM were actually likely to cause a performance penalty due to how they were implemented at the time.
- TCO makes debugging more difficult since it overwrites stack values.
Indeed, the author of the RFC admits that Rust has gotten on perfectly fine thus far without TCO, and that it will certainly continue on just fine without it.
Thus far, explicit user-controlled TCO hasn’t made it into the Rust compiler.
Enabling TCO via a Library
However, many of the issues that bog down TCO RFCs and proposals can be sidestepped to an extent. Several homebrew solutions for adding explicit TCO to Rust exist.
The general idea with these is to implement what is called a “trampoline”. This refers to the abstraction that actually takes a tail-recursive function and transforms it to use an iterative loop instead.
How about we first implement this with a trampoline as a slow cross-platform fallback implementation, and then successively implement faster methods for each architecture/platform?
This way the feature can be ready quite quickly, so people can use it for elegant programming. In a future version of rustc such code will magically become fast.
@ConnyOnny, 4
Bruno Corrêa Zimmermann’s tramp.rs library is probably the most high-profile of these library solutions. Let’s take a peek under the hood and see how it works.
Diving into tramp.rs
The tramp.rs library exports two macros, rec_call!
and rec_ret!
, that facilitate the same behavior as what the proposed become
keyword would do: it allows the programmer to prompt the Rust runtime to execute the specified tail-recursive function via an iterative loop, thereby decreasing the memory cost of the function to a constant.
The rec_call!
macro is what kicks this process off, and is most analogous to what the become
keyword would have done if it had been introduced into the Rust compiler back in 2014:
macro_rules! rec_call {
($call:expr) => {
return BorrowRec::Call(Thunk::new(move || $call));
};
}
rec_call!
makes use of two additional important constructs, BorrowRec
and Thunk
.
enum BorrowRec<'a, T> {
Ret(T),
Call(Thunk<'a, BorrowRec<'a, T>>),
}
The BorrowRec
enum represents two possible states a tail-recursive function call can be in at any one time: either it hasn’t reached its base case yet, in which case we’re still in the BorrowRec::Call
state, or it has reached a base case and has produced its final value(s), in which case we’ve arrived at the BorrowRec::Ret
state.
The Call
variant of the BorrowRec
enum contains the following definition for a Thunk
:
struct Thunk<'a, T> {
fun: Box<FnThunk<Out = T> + 'a>,
}
The Thunk
struct holds on to a reference to the tail-recursive function, which is represented by the FnThunk
trait.
Lastly, this is all tied together with the tramp
function:
fn tramp<'a, T>(mut res: BorrowRec<'a, T>) -> T {
loop {
match res {
BorrowRec::Ret(x) => break x,
BorrowRec::Call(thunk) => res = thunk.compute(),
}
}
}
This receives as input a tail-recursive function contained in a BorrowRec
instance, and continually calls the function so long as the BorrowRec
remains in the Call
state. Otherwise, when the recursive function arrives at the Ret
state with its final computed value, that final value is returned via the rec_ret!
macro.
Are We TCO Yet?
So that’s it right? tramp.rs
is the hero we all needed to enable on-demand TCO in our Rust programs, right?
I’m afraid not.
While I really like how the idea of trampolining as a way to incrementally introduce TCO is presented in this implementation, benchmarks that @timthelion has graciously already run indicate that using tramp.rs leads to a slight regression in performance compared to manually converting the tail-recursive function to an iterative loop.
Part of what contributes to the slowdown of tramp.rs’s performance is likely, as @jonhoo points out, the fact that each rec_call!
call allocates memory on the heap due to it calling Thunk::new
:
So it turns that tramp.rs
’s trampolining implementation doesn’t even actually achieve the constant memory usage that TCO promises!
Ah well. Perhaps on-demand TCO will be added to the Rust compiler in the future. Or maybe not; the Rust ecosystem has gotten by just fine without it thus far.
Links and Citations
1: https://stackoverflow.com/questions/42788139/es6-tail-recursion-optimisation-stack-overflow
2: http://neopythonic.blogspot.com/2009/04/final-words-on-tail-calls.html
3: https://github.com/rust-lang/rfcs/issues/271#issuecomment-271161622
4: https://github.com/rust-lang/rfcs/issues/271#issuecomment-269255176
Top comments (5)
Thanks for the great post. I found the tailcall package actually does offer thunk-free trampolining, for non-mutual recursion at least. Here are @timthelion's benchmarks ported over for tailcall, and the results: github.com/alecdotninja/tailcall/p...
Also, aside from this option, it seems that Iterators also offer a form of TCO, or could be used as a vehicle for such.
🔥🔥🔥
Even if the library would be free of additional runtime costs, there would still be compile-time costs.
Is TCO so important to pay this overhead?
That's a good point that you raise: is TCO actually important to support in Rust?
I think to answer that question, we'd need data on the performance of recursive Rust code, and perhaps also how often Rust code is written recursively.
In my mind, Rust does emphasize functional patterns quite a bit, especially with the prevalence of the iterator pattern. Despite that, I don't feel like Rust emphasizes recursion all that much, no more than Python does from my experience.
So perhaps there's an argument to be made that introducing TCO into rustc just isn't worth the work/complexity.