I got a chance to talk with CppCast* and one of the questions that came up was self-hosting. I thought I'd write something quickly about it, and leave it open to discussion as to the value in it.
*Sorry, no embed as they don't appear to be listed on the podcasts here. 😟
I was working on the Leaf compiler, which we talked about on the show because it was written in C++. I was asked whether my goal was to self-host it.
Self-hosting means the compiler is written in the language itself. In this case, it'd mean the Leaf compiler is written in Leaf. A lot of languages have this is a goal, such as Go or C#. Whereas a language like Python or JavaScript are not self-hosting.
I said that self-hosting was never a goal of Leaf. I think that answer was a bit unexpected. Sure, ultimately, it would be self-hosted, but I couldn't think of any short-term value in it. It brings no new features to the language, yet creates a boot-strapping hassle. Having another language written in C++ doesn't hurt either community. Just as Python being written in C doesn't detract from either Python or C.
Do you think that having a self-hosting compiler represents a kind of maturity for the language? Is it a case of eating your own dog food? Or is it unnecessary and distracting from a language's main goals?
What do you think?
Top comments (11)
Primarily, self hosting was a way to gain platform independence. If your language was self hosted, then if you needed to target a new platform, you added support to the compiler for that platform, then cross compiled it for the platform with the support you just added. And voila, you have ported it. If you have a language for systems programming, this is essential. Otherwise how do you get the language onto a platform that just came into being?
Plus, for much of the history of computing, the compiler was likely to be the most sophisticated program written in a language, so it gave a language an immediate, sophisticated testbed. It also acted as a brake on language complexity. If your language is self hosted, any feature you add you have to balance against the complexity it adds to the compiler. This biases the language towards being good for writing compilers, rather than some other class of program.
If you are working on languages that aren't for systems programming, it matters less. MATLAB, for example, started as an interpreter to access BLAS libraries, so it assumed the existence of a FORTRAN compiler and toolchain on the system (or you wouldn't have BLAS in the first place). If you have a FORTRAN compiler and you build on that language and toolchain, then cross compiling to a bare platform is irrelevant.
I wonder if it's different now that we have platforms like LLVM. By using the LLVM backend, it's easy to target a wide range of platforms.
I wonder how much of a language has to be written in itself to be considered self-hosting. I don't think one would expect that LLVM is replaced, nor the regex library, nor any complex libraries that might be used (I used libgmp in Leaf).
I never thought about how it biases a language towards compilers. Maybe that's why it feels so natural to write compilers in C++. :)
Imagine that aliens land and start selling their bare metal microcontrollers for a price we can't refuse. If my compiler actually emits machine instructions directly, then I can add the new microcontroller to it and produce a compiler for the microcontroller. That's where that benefit of self hosting comes in.
That kind of scenario is simply rare today. Our processor families are kind of entrenched. And LLVM, like FORTRAN for MATLAB, is an assumed environment. If you start assuming LLVM, there's no reason to be self hosting. That being said, self hosting languages can quickly develop LLVM backends, and many have, by treating it as a new machine to port to.
If you think it feels natural in C++, you should try the ML family (Standard ML, Haskell, OCaml). Those languages are deeply optimized for that kind of data manipulation.
Why would emitting machine instructions be better than emitting LLVM IR instructions? Is there some reason to believe that a shared IR would be harder to migrate to a new platform than an exclusive one?
Note, on Leaf I had a Leaf IR, which was already quite low-level. Unlike LLVM IR, Leaf IR still have a tree scope structure.
If you are targeting a new architecture, you probably don't have a way to translate LLVM IR instructions into machine instructions, so you'll have to do it yourself. Again, if the language assumes that you always have a mature environment where LLVM has been ported, it's irrelevant.
I think this is a good point you make. LLVM is good for targetting a family of related systems -- basically Linux, Windows, MacOS. A truly new architecture will either be folded into that family, and thus LLVM will apply, or LLVM won't really help that much.
Though it does target some unusual architectures. I think mainly you'd want to keep it for the shared manpower of optimization. But in Leaf, my IR was low enough that it wouldn't take too much effort to lower it to a target machine code (albeit, it'd be an inefficient machine code compared to LLVM).
As long as a language is Turing-complete, a self-hosting compiler is possible; after the development of mid-level languages like C and tools like yacc, putting in the effort to realize it means the language has reached the point of being able to show off. It's an important milestone!
I believe it needs to be more than turing complete, since self-hosting would imply the ability to interact with the environment, the OS, in producing files.
I believe CSS3 is turing complete, though I don't imagine it'd be fun to write any kind of compiler in it. :)
Hehe, yeah I think you'd literally have to act as the CPU to compute things in css3: You'd need to click for each step in the computation manually! 😅 And indeed, having fancy I/O options is not part of the definition of turing completeness.
One cool aspect of self-hosting is that any performance improvements to the language make the compiler faster, which creates a nice positive feedback cycle. It also discourages excessive breaking changes to the language since the language developers have to migrate the compiler itself to every new version and deal with any breakages. That also gives me a degree of confidence in the stability of a language. Python, for example, has had major issues with breaking changes to the language where the upgrade path was unclear. On the other hand, Go is an example of a self-hosted language that makes a point of introducing no breaking changes and only performance improvements to language updates.
On the contra, by not using a shared backend you lose out on all the optimizations that other people are making in the shared back-end, such as in LLVM.
Python's big break is one of the things that motivated my desire to build a language that would have versioned syntaxes. You can slowly improve syntax, without breaking old projects.