As a programmer, Compilers have always seemed to me like a million line black box only out-daunted by making an operating system. But hard challenges are the best challenges, so a while ago I set out to try and make on myself.
The process of lexing or, Lexical Analysis is, relative to the rest of this process is actually very straightforward. Consider the following code:
const hello = "Hello, " + "World!"; const sum = 4 + 5;
When lexing a piece of code, you must go through the entire source and convert the string into a collection of Tokens. Tokens are simple structures that store information about a small sliver of the source code. For the lexer that I wrote, I use four main Token types:
Symbol. So the code above might look like something this after lexing:
Keyword<"const"> Word<"hello"> Symbol<"="> String<"Hello, "> Symbol<"+"> String<"World"> Symbol<";"> Keyword<"const"> Word<"sum"> Symbol<"="> Word<"4"> Symbol<"+"> Word<"5"> Symbol<";">
If you've made it this far, then Awesome!
My project, Mantle, makes this super* to do through an abstract class you can extend called
mantle.lexer.Lexer. You simply define a list of keywords, symbols, and string delimiters, tell it whether to allow comments or not, and pass a function that defines if a character can be used in a word. After that, creating the list above becomes as easy as calling
Lexer.parse() but moving on, you will almost never call
More on mantle can be found at https://github.com/Nektro/mantle.js
This is the hard part.
Parsing requires you to figure out patterns of tokens that can compress the token list into a single node. This took a lot of trial and error to get right, and is the main reason why this project took so long.
For instance for the code we had above we might define the following rules:
Add <= String + String Add <= Integer + Integer AssignmentConst <= const Word = Add StatementList <= Add Add
There are more complex rules, the more complex the language which I discovered very soon.
The JSON example for
mantle.parser.Parser can be found at https://github.com/Nektro/mantle.js/blob/master/langs/mantle-json.js
This is the process of going through your final condensed node, also called an Abstract Syntax Tree, and
toString()ing them all until you get your new output.
Optimization of higher-level languages requires a lot more work than calling toString(), but is way above my scope
At this point I was ecstatic. I successfully made a JSON parser. But I wanted to make something a little more complicated. So I moved onto HTML. The thing is though, HTML isn't very well formed. So I thought I'd make a version that's a little easier for Mantle to parse. And that's how a came onto Corgi.
Corgi syntax is inspired by Pug but isn't tab based so you can compress a file onto one line theoretically. I loved this because forcing the tab structure made using cosmetic HTML tags in Pug really awkward. So Corgi makes HTML great for structure and style.
An example Corgi document would look like:
doctype html html( head( title("Corgi Example") meta[charset="UTF-8"] meta[name="viewport",content="width=device-width,initial-scale=1"] ) body( h1("Corgi Example") p("This is an example HTML document written in "a[href="https://github.com/corgi-lang/corgi"]("Corgi")".") p("Follow Nektro on Twitter @Nektro") ) )
Making compilers is hard but has definitely been fun and I hope this helps demystifies them some.
And now I also have an HTML Proprocessor I'm going to use in as many projects as it makes sense.