Marianne

Posted on Apr 14, 2021

Programs Split Over Multiple Files (featuring Troels Henriksen)

#programming #module #compilers

Summary

When thinking about how to create a language where little models can be combined into bigger more complex system models, Marianne struggles to understand why not to take the completely straight forward approach of importing files. While searching for a good explanation she comes across the official blog of Futhark and decides to interview its lead on their design decisions.

Links

Design Decisions I Do Not Regret

r/ProgrammingLanguages

Fault-lang WIP Compiler

Bonus Content

Become a patron to support this project.

Transcript

MB: Back when I started this series, Thorston Ball made a comment that stuck with me

MB: Is there something that you would really like to see people like add to Monkey that hasn't happened yet?

TB: I would be interested to see, like, a really well thought out module system.

MB: And I thought to myself …. what, is that hard? Oh crap…. that’s probably really hard isn’t it?

TB: And I feel like that is…. I don't even know what that means, you know?

MB: I added this to my research list—figure out module systems—and surfed over to my favorite community for program language design advice: r/programminglanguages. This subreddit is full of great stories and people will give detailed explanations and encouragement, which is rare on the internet these days.

MB: While reading through conversations on r/programminglanguages I came across a blog post on the official site for this niche functional programming language called Futhark. Futhark is an experimental programming language for high performance computing run by the University of Copenhagen. Because of that, their blog is full of interesting design discussions, including a post entitled Design Decisions I Do Not Regret which— among other things—goes into the design of Futhark’s module system

MB: So I asked the author of this post, Troels Henriksen, to chat with me about why he created a programming language and how he thinks about these decisions.

TH: I think one of the problems was that when you are a bank every day, you must you must figure out how much money you have. And in a modern bank, it's not just about counting how much money you have in your vault. It's about counting how much your various contracts and options and derivatives are worth. And often there is no easy answer to that because that's actually a probability distribution, because if you have an option that allows you in two weeks to buy some stock at some price, then the value of that option depends on what that stock may be worth in two weeks.

MB: Yeah

TH: And that's and you have to figure that out. You basically have to a Monte Carlo simulation to figure out what to look at. What are the various ways the stock market can go and then take the average or something? I don't fully understand all the mathematics behind it.

MB: (laughs)

TH: Yeah, and imagine if you had to do that when you're looking at a banknote…. But that means that when you want to price options, you need to be able to do a lot of compensation very quickly. Yes, you have one day to figure out how much it's worth and it's kind of obsolete information. So there was interest in doing High-Performance functional programming. And and that was kind of as much of a plan as there were. And then I got involved as I was a TA n a compilers course in my department. And one of the teachers was involved in this project. And he recruited students to try and work on a language that was supposed to be for GPU programming in a functional style. And I’d been a TA in the course, and I I didn't like the way he had designed the compiler. And so I wanted to show him a better way of doing it. So I took his very early draft of a syntax tree and rewrote it and made it better. And then I just kind of went on from there in my free time and created a very small compiler, generate sequential code, wasn't very fancy. And then he hired me as a student programmer and eventually as a Ph.D. student.

MB: But like what kind of— what is the average user use it for, like, did it end up being used in the use case she thought it was going to be used for like banks…?

TH: I don't think there are any actual banks that use it because that's also not really— we need to really expect that banks are conservative and it's a research project. It's not supposed to be a product. But Futhark was kind of by accident, designed to be really good at Monte Carlo simulations. And it is still useful that to some people. But I would say the average user of that is either a student who has been forced to use it for a project—

MB: (laughs)

TH: …or an independent researcher who is trying to investigate some new kind of model where, sure, they could write up in Haskell, but then they would not be able to validate their model because it would run too slow. And the model is too unusual to easily express in TensorFlow or raw primitives like GPU, linear algebra functions. So Futhark is a nice compromise where you have a lot of flexibility to just change things and try things out, but it still runs pretty fast on GPUs or multicore CPUs and so on.

MB: So how do you approach making design decisions for the language that you're working on now or like languages you worked on the past? You mentioned that the the original spec had a compiler design that you didn't think was all that efficient. Right? So you chose a different design, like how do you approach that decision making process?

TH: So Futhark has actually changed a bit in superficial terms a lot since it started. It started out being a very crude first order language that kind of resembled Standard ML a little bit, but it had parentheses for application and so on. It was kind of nasty. Was designed by my former advisor, who is an excellent compiler engineer. But his previous life was writing Fortran compilers in C++.

MB: Nice!!!

TH: We disagree on certain points. So gradually the language became what it is now, which is more Haskell like more classically lambda calculus style and usually the way since we don't to try to innovate that much in languages, we want to just use the language a little bit. And then we said, well, this sucks, this looks really ugly, can we do better? And we just looked at mainstream languages and of course Haskell is mainstream that and we figured we just what can we do better? Inspired by these language.

TH: So we didn't actually innovate a lot, which was just copied from other languages. And I think that's actually a good way of designing a language because they're the design space is enormous and the vast majority of of design decisions are terrible. I have terrible interactions with all those design decisions. So innovating actually doing something completely novel is really dangerous unless the novelty is exactly what you're going for.

MB: Yeah

TH: Something like the syntax and the type system and so on. That wasn't really what we were trying to innovate on, at least not initially. So we just looked around for things that people only tried and used for decades and we knew what worked and what didn't. And then we were talking maybe a little bit because existing languages have things that that that have some flaws because they weren't realized initially. And they can change it because backwards compatibility, but then we can tweak them a little bit. But we intentionally tried not to to innovate too much in areas outside of our core competency.

MB: I’ve heard some variation on this advice from several people so far: pick something you like in another language and copy the implementation to start because figuring out all the edge cases from scratch is really hard. And for the most part it has been advice that I have followed.

MB: But the problem with module systems is that even after reading a bunch of arguments about it … it just …. seems easy? Right? Like I’m sure if you want to do something with object inheritance and complex namespaces it’s harder, but at the end of the day isn’t it just importing code from another file into your program? I know the complications must be there, but I can’t really see them for myself.

MB: The debate comes down to this: you can do imports using file paths strings— maybe just the filename if the main file and the code being imported are in the same directory, or a relative path using dots and slashes to define where in the file system tree the code to import lives.

MB: Or you could do something more sophisticated where the module location is defined relative to something other than the file where the import is taking place, perhaps the application root directory.

MB: Obviously most mature general purpose programming languages use the second approach, but I would prefer the simplicity of the first approach to be honest…

MB: From my vantage point, being able to split a system specification into smaller parts means you get to reuse those parts and build progressively more complex systems that are in easily digestible chunks. So I've been trying to figure out exactly what is the lay of the land on on this and like how—where are the complications? And that's how I came across the stuff that you had been writing about your own journey through this, which I found really easy to read and compelling, which you do not necessarily find to be the case about programming, language design writing. So I'm interested in hearing more about how you came to the decision you did and like why people look at that more simple file import pattern and go, oh, no, but when you get to larger programs, that's a disaster. So don't do that.

TH: Yes, so, perhaps we should clarify what you mean by module system, because it's actually two distinct things here. One is modules as a thing that allows us to separate a program into multiple files, which, of course, both desirable and probably also necessary. And then there is an orthogonal concept, which is about how you can structure your program in terms of namespaces and also more advanced modular features, but namespace probably what most people associate with modular systems. Yes, these are all orthogonal.

MB: Yeah so… Quick question. Is there an actual specific term for one versus the other or is it literally that is the same term? Because I've had this conversation multiple times where I'm like, “How do I build a module system?” And everyone is like: “Well, what do you mean by module system?”

TH: Yes, no, I have actually been thinking about that ever since you invited me to this interview. And unfortunately, there is no term for splitting things into files, at least not academically.

MB: OK, good.

TH: I think it's because the academics consider it as a boring problem or haven't discovered why it might be interesting. So in an ML they call it, they have something called the basis system which does this. But it's I think that's a very ML specific term. And I would say it's not one of the most successful parts of the design. So let's ignore that. But I think it's because you can always take a program written within multiple files and then just concatenate them and pretend them that it was a program all along.

MB: True.

TH: Of course, it's not practical, but. But that's a separate thing.

TH: OK, but so are these two things. And then of course, related in most languages and in many languages. You only have file division. That is also the unit of modules. And I think Java ,Java is complicated because you have the split into files, which I think is called packages in Java. But then classes in Java also behave a lot like modules because each encapsulate a namespace that prevent them from clashing and they can they do access control on a name. So that's a bit like like modules. So. in Futhark, what we have is, of course, ML style modular system. It's taken almost verbatim from standard ML, but with some places a few extensions… in particular, a syntax improvement. And that is a very odd modular system. So ML doesn't talk about files at all. So the module system is built with the assumption that your program is just one big file.

MB: Mm hmm.

TH: And then the module system can do the usual stuff, you can you can create a module and you can have definition inside of that module, and then whenever you want to access one of those definitions, you have to prefix them with the module name, just as usual name spacing thing that you have in any other language. But then ML also adds extra signatures, which allows you to say, I have this module and this module implements this signature, and then you can only access those parts of the module that are defined in the signature that allows you to hide names and make types abstract. So that's how it's information holding

MB: Like Public, private. And you have to export it essentially.

TH: Yeah. Yeah. Oh, yes. Yeah, kind of. Except that it's kind of separated from the module. Modules and signatures are separate and you can have multiple modules and implement the same signature. So it's kind of like interfaces and classes. But but you can instantiate emotion. It doesn't make any sense. It's also a little bit like header files in C and files, of course. And C is a mess for other reasons.

TH: But then ML goes even further and they add this notion of functor or parametric modules, which is kind of like a function at the module level where you can say if you give me a module that implements the signature, then I will give you back a module that implements other signature. So, for example, if you give me a module that implements a signature for numbers, then I will give you back a module that implements various linear algebra operations on arrays that contain that kind of numbers. So that's kind of a way of doing genetic programming at the margin level that that ML pioneered in the 70s and 80s, and that is is an aspect of ML that hasn't really been imitated in many other places because it is kind of a complicated system and many other languages to this with type classes or traits instead.

MB: So what are the what are these kind of constructs prevent? Because like so if we we have a module system that just concat needs multiple files together and treats it like that, that's very easy for me to understand why that could potentially go wrong. You have namespace issues like right off the bat. Right. But you could like sort of create a map that broke down in namespaces into scope's really easily and then like do that. And that seems to me to be like not a super hard thing to do. So why do this? The why does it seem to have more complexity than that? Or do we just refer to it with fancy terms that make it sound like it has more complexity than that? But ultimately, that's all that's going on?

TH: No, I mean, the main reason why we like this idea of splitting things into files is, is to enable separate compilation. But thing I think it's easier to solve technically, you can just take everything with a unique identifier and then with make symbol tables, you can handle that easily enough. But a separate compilation is, of course, critical for any large program. You don't actually want to recompile everything all the time. And files are a very natural notion of very natural unit of splitting up into. So every source file becomes an object file or a class file or whatever. And then only when you change the source file, you have to change the object file regenerated object by combining them all at the end to get your program that works. And and I think almost all languages work that way, even if they're not, if they are trying to hide it.

TH: So this idea of just concatenating all the files, that there are two ways of doing it in the style of doing it. In the ML style of doing it. Sure. Conceptually you can naming all of the files, but they still all individually type check and parse check and all this on their own before the compiler starts putting them together and then doing co-generation or whatnot. Whereas in C, this concatenation of files really is just inserting the contents of the header file or whatever else, and you can include anything you want. Doesn't have to be a header file. And just putting it in way they include was if through the same file multiple times and you get multiple copies of the file, take precautions to avoid that with practice or or the if-not-def trick that is ever present in C code. And then only after that inclusion has been done by the C parser get to run. So you can kind of have files that look syntactically incorrect. But then after all the inclusion is done, they are suddenly typed and the syntax is correct.

MB: Okay that made sense: separate compilation, and figuring out when the parsing and type checking happens. I’m beginning to feel more comfortable with this now.

MB: I hate working in languages that have confusing import paths schemes like this is my number one biggest, I am annoyed by the way you do this in this language, like first… learning-a-new-language-thing is not being able to figure out, like knowing where the file is on the computer and not being able to figure out how to get it to import things correctly.

MB: So, like my gut instinct is here is to just make it so that people can go like …/ ../ Here's where my file is like, please import this file right like that. That to me is the easiest and most intuitive thing. And the feedback I keep getting in this very cryptic fashion is. “But if you do that, people can't write large programs easily.” And for the life of me, I do not understand why people believe that to be the case. Like why would make a difference? Right. I feel like it's just an interface versus like… I don't know why I get that feedback. So I was interested in your perspective, having also made a decision, from my understanding, to go in the dot-dot-slash versus just using import paths.

TH: Well, I agree with your initial analysis. I think no one enjoys learning about import behavior. So when we initially improve the design, the file includes mechanism. And again, that's was one of those things where we really had no ambition to improve on the state of the art. This was not our research problem. We just needed some things we have two files in one program. And so we looked at various ways of doing it. And initially we did something python like where you write the name as kind of a variable name and with dots, and then that gets translated into a file name by some builtin rules. And eventually we just look at it and said, why isn't this just a string literal giving the filename? It is a file at the end are we trying to hide it? And that languages or have been languages that run in more advanced settings where we don't have files, IP image based system like Smalltalk or lisp machines or or unison for a much newer example that are trying to innovate and say, well, what if we didn't have files in our language, then that's cool. And I think it's very worthwhile research. But unless that's what you are doing, don't hide the files. I mean, there are files that people are going to be frustrated that they can't just access them. And then, of course, maybe in 10 years when we are all running on the unison VM, maybe I'll come up with another syntax so you can import some object storage based on the content based hashing or whatever. But for now, files are fine.

MB: Yeah.

TH: So the reason why I think some people suspected some scale is that if you are an extremist about it like I am: in Futhark the file name is relative to the important file.

TH: So that means when you move files around, you may have to change the import strings to reflect where the files are now located relative to the important file. And I guess people will think that will be too much of a bother. Well, first, I think it's pretty easy to automate because that is just you can you can you can find quick scripts to fix this just based on your understanding of hierarchical file systems, which are pretty universally understood by every programmer. And second, if you will, if you really have tons of imports every file, like dozens, but you might in a real world system, Futhark is for small programs. So we don't have that. But I can see that might happen then. I think you can address that if as long as you have a nice mechanism for re-exporting multiple files from one file, like what we call them open imports. So that's and I guess all the language might have other names for them. And if you can do that, then you can just have one file that includes that in some some single known location that includes a bunch of other files based on that location. And that's the single file that that is included by all the others. And that means you have less to update when you when you move stuff around.

TH” But I don't know if it's really that common to constantly just move files around. And if I include statements all the time, I I've never worked on truly enormous systems. So I don't know, maybe I'm just talking out of ignorance. Yeah. But I can't believe it's that big of a problem.

MB: I mean, like I think it might. I also don't know how often people just move stuff around. Like I spend a lot of time professionally working on the maintenance of old systems and generally speaking, getting anybody to do any kind of re-architecting or redoing of code that's already on a file somewhere. It's just pulling teeth. So I have a hard time believing that people just arbitrarily moved, rearranged their files on a regular basis. But I also am not sure that you don't end up in the same problem when you're using a more complicated, like import part syntax and also moving. I'm not sure it's clear to me that that that problem is not also a problem when when you have a more complicated import file import scheme.

TH: Another reason that people think it might not scale is because of what it looks like. So when you design a language, you kind of you probably use in designing a language that is better than some specific problem. And probably you are to some extent motivated to designing a language that looks a lot like the languages that are already solving roughly that problem. Of course, you will try to do it better. And if you think about languages that use filenames to import all the files and then use less language like PHP and Shell scripts and to the extent C where it is also a little bit messy, whereas languages that we acknowledge are more well designed like Java or Haskell, they use some more abstract notion. They use package names or module names which are translated to files eventually with some mechanism. But but they don't just put a string that there. So you end up thinking that, well designed languages, they don't refer to files. But I don't think the problem with PHP is that

MB: (laughs) It's certainly not the largest problem with PHP.

TH: The problem with PHP and Shell script is that when you include all files, sure, you can put a string literal to there, but you can also put a piece of code in your computer file, including and that's, of course, madness. And then you cannot analyze it statically. But in Futhark, you put a string literal in the import statement, and that is technically a literal it cannot be an expression. So it might look like it's the same thing as in PHP, but it isn't. That’s full static knowledge. nothing fancy going on. You cannot do a dynamic import or anything like that.

MB: So…. this is going to be the last episode for a while. I’ve made some design decisions I feel really good about, but it’s clear that the only way to validate them is to write code and try things out. I’ll be back in a couple of months to let you know how that went and do some research into optimizing the compiler. But I really just need to just be heads down hands on a keyboard for a while on this.

MB: If you want to track my progress, maybe send me a pull request to correct all my stupid mistakes. I’m pushing code to a repo under the Github Org named Fault-lang. If you’d prefer a less intimidating way to participate in the development of my fledgling language, I got one additional great piece of advice from Professor Henriksen:

TH: Oh, and another piece of advice, find a cute animal mascot for your language.

MB: I know! I've been really thinking about that, thinking very hard about that.

TH: I have my Futhark cup here that one of the PHD students did.

MB: Is it….?

TH: With a hedgehog

MB: Oh, it's a hedgehog! I was like I first thought it was a monkey. And then I was like, no, it looks like hedgehog… It is a hedgehog?

TH: Yes.

MB: Ah cute! Yeah, I know. I have to figure out what our cute animal mascot is going to be because that's really… that's really critical to a language’s success or failure is the ability to have an attractive logo and a cute mascot.

DEV Community

Programs Split Over Multiple Files (featuring Troels Henriksen)

Summary

Links

Bonus Content

Transcript

Top comments (0)

Read next

Next.js 15 Form Component: A Beginner's Guide

How are responsive websites doing in 2024?

Unlock Rich Landmark Data with the Landmark API: Features, Usage, and More

Introducing the AWS CDK L2 Construct: Simplified Security for Amazon CloudFront with Origin Access Control (OAC)