Among my favorite subjects are the ways in which human language differs from the code we write for computers and how to avoid the evils of excessive complexity, both of which I've written about extensively in previous articles. This one is in part a retrospective to explain where these passions came from.
Background
If you Google custom programming language you will find a wealth of information about how to write your own programming language, as well as long lists of languages that have already been written. Most of these are completely unknown to all but their creators, so why do people write them and what are they used for? In my case it started some 30 years ago, but before I head off down Memory Lane, first a couple of paragraphs about computer languages in general.
Languages can be categorized in different ways and one is to class them as general purpose or domain-specific. All the mainstream programming languages, e.g. C/C++, Java, Python, JavaScript and PHP, are general purpose; designed to address any computing problem, and although custom general-purpose languages exist they tend to do things in much the same way as the mainstream ones.
Most custom languages are domain-specific languages (DSLs); that is, they are designed to address a specific application area, or domain. A few of these are well-known; HTML and SQL are probably the best examples. Some custom languages have been written to perform general-purpose duties but using a non-English vocabulary; others are heavily symbolic and are designed for use by mathematicians. Some are simple, others incredibly complex and one or two are just designed to mess with your mind; the programming equivalent of Klingon.
Many programmers regard writing a custom language as either unnecessary or in some way "outside" normal programming (or both), but a well designed DSL can separate high-level concerns from implementation details more cleanly than any other software technique. Syntaxes that comprise mostly space-delimited words (think of SQL in its most basic form) are quite easy to handle if they are sufficiently well constrained. They can either be embedded in other code (like Excel macros) or act as a wrapper around it and they tend to map well to the requirements expressed by domain owners.
Origins
I was first drawn into the world of domain-specific languages about 30 years ago, when required to write software for 8-bit microprocessors to control audio equipment such as mixers and routers in a radio studio. The coding for each piece of equipment had many similar features but the details varied wildly and were subject to frequent change. At the time there were no decent C compilers for such modest hardware and assembly code was very hard to write, understand or modify. The solution was to take a path I've followed ever since; to devise a cut-down, unambiguous form of English that expressed precisely what the program should do. This would be readable not just by myself and other programmers but also by the studio engineers who were not coders but understood better than anyone the domain they worked in and knew what they wanted.
The next step was to write a compiler that would take these instructions and turn them into intermediate code - an efficient, compact representation of their meaning - and a runtime interpreter that would execute the intermediate code on the target hardware.
The results were successful in that they solved the problem at hand. The language itself is long forgotten but it gave me insight into what programming is all about; insight that is often lost in our technology-led rush to adopt and apply the latest techniques. We use hydraulic presses to crack nuts rather than think about what simpler tools might do the job equally well.
Future-proofing
A second division of programming languages is between those that are readable by non-programmers and those that aren't. To me this is far from a trivial point. It is very bad practice to write code that cannot be read by anyone but yourself, unless you can be sure the code will never have to be read by anyone except yourself. And "anyone" also includes your future self; a person who has long forgotten the details of the project you are working on now and who will not thank you for requiring excessive effort to figure out what it all means.
Just about the only thing that can be guaranteed about the future is that they'll all still speak English. And that brings me back to my old radio studio code, where we wrote control programs in something as close to English as we could get.
Deep in a dusty archive I found some of this old code. Here's a fragment of a test script for one of the items; an audio switching unit the size of a small domestic refrigerator:
input 0 action ALT call Test
input 288 call Prompt
enable inputs
set alpha 1 to "Ready"
prompt "System ready"
stop
Test:
if ON
begin
set output 0
set output 288
set output 289
set alpha 1 to "Testing"
set alpha 40 to "Chan 40"
set alpha 41 to "Chan 41"
end
else
begin
clear output 0
clear output 288
clear output 289
set alpha 1 to ""
end
stop
Prompt:
if ON set output 1
else clear output 1
prompt "288"
stop
The date on this code is some time in 1999, though I'm sure it was initially written several years earlier; by whom I don't remember. I left the company 17 years ago but I do recall that the audio switch implemented an X-Y matrix of audio inputs and outputs with the ability to connect any input to any output via mercury relays. It had an array of small alphanumeric displays (at least 41, it seems from the code) and a large number (hundreds) of logical inputs, each of which had rules governing its behavior. Some connected to front panel buttons; others reacted to events elsewhere around the studio.
The script defines action handlers for 2 of the inputs, giving each the name of a program label to execute code from when their state changes. The default action was MOM (momentary) but some buttons had an alternate action (press to set ON, press again to set OFF) so the first input here defines ALT as the action. Without diving into the source of the runtime I assume the language sets the value of ON to be the state of the input whose change of state is being handled, but I have no idea why the label Prompt
has that name when its action is to set or clear an output.
The scripts got quite large. I found another that defines actions for 288 inputs but I'll spare you that one.
The point is that anyone with even a passing knowledge of the system can read the code and have some idea of what it does. I've only shown a fraction of the command set available; it included numeric and string variables, 4-function arithmetic, string handling, control structures like if
, while
and switch
as well as keywords like input
and output
that related directly to the hardware of the system. The compiler was written in C++ on Windows and the runtime was written in assembler for an 8-bit TMS370 microprocessor. The compiled scripts arrived via a simple loader on a serial bus, to avoid having to reprogram EEPROMs.
So?
I called this piece "Programming without a programming language" because that's really what we were doing back then. If you were to describe the actions of the audio switch you'd write pseudocode in English that bears a close resemblance to the code above. And that's what usually was written by the studio engineers; all I did was remove ambiguity and syntactic noise to leave a terse but easily understood script that can still be read and understood by almost anyone, decades later. I don't think we ever even dignified the language with a proper name, but after ten or so years working in this way the habit became ingrained and now my approach to any programming problem is to first ask myself what the solution would look like in English, then if it seems viable consider how to build a compiler that can handle that syntax.
Not every problem is directly amenable to this approach. Many entities are complex and result in cumbersome code when scripted. However, the power and ingenuity of human language is such that in these situations we invent new vocabulary and syntax to represent and deal with new entities, giving them names that go into the language seamlessly. In English, new entities arrive continuously; laser, smartphone, barbecue, hypermarket, boycott, tomfoolery, millennial... the list goes on and on. In each case, the new addition neatly replaces a much longer set of words that previously had to be used each time to convey the same meaning.
Domain-specific scripting languages also have the ability to grow seamlessly as new features need to be added, as long as the original syntax is designed well enough to encompass the new without breaking. When a new entity is added, all the previously-needed cumbersome script is replaced by the new, simpler syntax; the code that makes it work is now written in whatever language the system itself is built with, hidden away from the people who have to write and maintain the scripts. This is a good example of 'separation of concerns'.
General-purpose programming languages don't have this facility to grow and expand on a piecemeal basis. Instead, we supplement them with a growing and ever-changing array of highly complex libraries and frameworks, most of which are anything but intuitive. Instead of enhancing the language, which the human brain handles well, we add structure, which our brains are less well equipped to deal with. (In everyday life we handle new vocabulary with ease but most of us struggle with grammar and have to consult weighty textbooks to gain understanding.)
For people who work with the tools on a daily basis this is fine, but it's not good for long-term maintainability, partly because the level of skill required is more commonly found in developers than in maintenance engineers, and partly because the tools themselves have a habit of rapidly becoming obsolete when something newer and shinier comes along. That's the thing about most development; it's all about the cutting edge. A more cautious approach would be to suggest that complex tools should only be used for code that is guaranteed high-level ongoing support by people with skills and experience equivalent to those of the original programmer(s). If this guarantee cannot be made there is a strong likelihood that the product will become unmaintainable after only a few years. Unfortunately, I see little sign that this problem is recognized by the software industry, which starts from the flawed assumption that anyone can 'read the code'.
None of this removes the need for complex tools. We use operating systems, word processors, spreadsheets and Java, JavaScript or Python compilers without having to know how they were built or how they work. A component that will never require maintenance except by its builders can use whatever technology is most appropriate to make it work. It can be relied upon to do its job day in, day out with no understanding needed about what goes on inside it. But when you're building something that others will maintain, spare a thought for who those 'others' are and don't make your product so complex they won't be able to understand it when changes are needed. Wherever possible, the simplest, most effective approach is to use the language of the user rather than that of the system.
And today...
When a couple of years ago I wanted to improve my rather basic JavaScript skills I set myself the challenge of recreating a variant of that old DSL, with both the compiler and runtime written in JavaScript so they can be run in any browser. The vocabulary and syntax of the language would be tilted towards the kind of things people use JavaScript for in web pages. This was a substantial project but it turned out far better than I expected and I now have a system good enough to build entire websites by describing them in multiple linked scripts that can be embedded in the HTML or downloaded on demand. It was a double win; in the space of those 2 years I also gained far more practical experience of JavaScript than I would have picked up from any number of programming tutorials.
More recently I wanted to learn some Python; it's just about the most popular programming language today and I didn't want to be left behind. It seemed to me the obvious thing was to repeat the exercise, this time writing my compiler and runtime in Python. This version is only a couple of months old so it has some way to go; I'm planning to provide it with features for writing shell scripts as an alternative to Bash and Perl (or indeed Python). My enthusiasm for high-level scripting DSLs remains undiminished.
Anyone curious about either of these projects can find all the current code - with some documentation - in my GitHub repository. I also welcome questions and comments, either here or via the contact details on GitHub.
Top comments (6)
Hi,JavaScript and PHP are limited to the web domain of programming, and so not normally classified as general purpose.
I have to correct you there.
Though both JS and PHP were designed to work in web environment (one on client other on server respectively) that changed over the years.
Nowadays you can use JS to build CLI, mobile or even desktop apps...
PHP did not traverse into client side (not directly anyway), but is used as cli scripting language. It is possible to do many other kinds of apps but there are far better tools.
What makes a GPL not being the DSL anymore is its (potential) usefulness in other area and evolution. Almost every GPL started as someone's pet project or DSL... people made it GPL
Good points, Ivan. When I cited HTML and SQL I was aware that although both are domain-specific they are also de facto standards the whole world uses. The conclusion being that although most DSLs start off as "pet projects" - or are born out of necessity as in the case of the one in my article - some have the good fortune to go on and become established. PHP and Python could probably be classed as former pet projects, whereas JavaScript - along with VBScript - was conceived to serve a particular need at the time. However, the distinction is rather arbitrary, as most projects come from someone's perceived need, rather than "I think I'll write a new language just for the hell of it".
A question implicitly raised by Paddy3118 is "what defines a web language?" Although PHP and JavaScript are both supported by huge amounts of code that targets web applications, neither of the languages themselves are heavily slanted to web use and both are quite capable of doing many of the jobs you might use Python or Java for. The two DSLs I'm currently developing are internally very similar, but the JavaScript one is definitely web-oriented and the Python one is definitely not. (The Python one is somewhat better written; it benefits from the experience I gained doing it in JavaScript, but both are in fact derived from an original I wrote in Java 20 years ago. This old dog has been chewing the same bone all that time.)
It's not a matter of what you can do; it's what most people actualy do with the languages. Read the online articles and blogs and the overwhelming javascript and PHP texts have been, and remain web related. The web is a large domain, and they often used within it, but very few people write about using them in areas unrelated to the web.
If GP is what a language could be used for then you need to work harder to distinguish the domain specific - if they are turing complete then there is nothing stopping them from being used outside of their domain.
Although I think it's incorrect to state that PHP and JavaScript are limited to the web domain (Qt bindings are available for both, for example) I accept that's where both are most commonly found. Overwhelmingly so, in fact. However, it seems the more I look at the term general purpose the more ephemeral it becomes. "Turing complete" - that is, the language has conditional execution and the ability to read and write data - covers any of the languages listed, including my own, so as you say, something else is needed to make a distinction.
One of the biggest challenges in designing a language (as the authors of JS have discovered) is in ensuring it remains unambiguous as it develops. Human languages are notoriously prone to ambiguity, which limits how close we can get to that "gold standard" when designing machine languages.
Even the term "Turing complete" might be less than exact. For example, how about a language that has no read/write commands of its own but which supports network sockets that allow a local REST server to do I/O jobs for it? JavaScript in its browser sandbox seems to fit in that category.
Perhaps any attempts to classify languages should start by restricting themselves to the core language and ignoring all the associated libraries and frameworks.
Nicely putπ