Meghan (she/her)

Posted on Oct 8, 2017 • Edited on Oct 10, 2017

How I accidentally wrote an awesome HTML Preprocessor

#javascript #compilers #corgi #webdev

As a programmer, Compilers have always seemed to me like a million line black box only out-daunted by making an operating system. But hard challenges are the best challenges, so a while ago I set out to try and make on myself.

OK.

If you want to write a compiler, there are three main parts. The Lexer, the Parser, and the code generator. I've started this project in a variety of languages including Java and C# but my successful implementation is currently in JavaScript.

1) Lexing

The process of lexing or, Lexical Analysis is, relative to the rest of this process is actually very straightforward. Consider the following code:

const hello = "Hello, " + "World!";
const sum = 4 + 5;

When lexing a piece of code, you must go through the entire source and convert the string into a collection of Tokens. Tokens are simple structures that store information about a small sliver of the source code. For the lexer that I wrote, I use four main Token types: Keyword, Word, String, and Symbol. So the code above might look like something this after lexing:

Keyword<"const">
Word<"hello">
Symbol<"=">
String<"Hello, ">
Symbol<"+">
String<"World">
Symbol<";">
Keyword<"const">
Word<"sum">
Symbol<"=">
Word<"4">
Symbol<"+">
Word<"5">
Symbol<";">

If you've made it this far, then Awesome!

My project, Mantle, makes this super* to do through an abstract class you can extend called mantle.lexer.Lexer. You simply define a list of keywords, symbols, and string delimiters, tell it whether to allow comments or not, and pass a function that defines if a character can be used in a word. After that, creating the list above becomes as easy as calling Lexer.parse() but moving on, you will almost never call parse() yourself.

More on mantle can be found at https://github.com/Nektro/mantle.js

2) Parsing

This is the hard part.

Parsing requires you to figure out patterns of tokens that can compress the token list into a single node. This took a lot of trial and error to get right, and is the main reason why this project took so long.

For instance for the code we had above we might define the following rules:

Add <= String + String
Add <= Integer + Integer
AssignmentConst <= const Word = Add
StatementList <= Add Add

There are more complex rules, the more complex the language which I discovered very soon.

The JSON example for mantle.parser.Parser can be found at https://github.com/Nektro/mantle.js/blob/master/langs/mantle-json.js

3) Code generation

This is the process of going through your final condensed node, also called an Abstract Syntax Tree, and toString()ing them all until you get your new output.

Note:
Optimization of higher-level languages requires a lot more work than calling toString(), but is way above my scope

4) Corgi - my new HTML Preprocessor

At this point I was ecstatic. I successfully made a JSON parser. But I wanted to make something a little more complicated. So I moved onto HTML. The thing is though, HTML isn't very well formed. So I thought I'd make a version that's a little easier for Mantle to parse. And that's how a came onto Corgi.

Corgi syntax is inspired by Pug but isn't tab based so you can compress a file onto one line theoretically. I loved this because forcing the tab structure made using cosmetic HTML tags in Pug really awkward. So Corgi makes HTML great for structure and style.

An example Corgi document would look like:

doctype html
html(
    head(
        title("Corgi Example")
        meta[charset="UTF-8"]
        meta[name="viewport",content="width=device-width,initial-scale=1"]
    )
    body(
        h1("Corgi Example")
        p("This is an example HTML document written in "a[href="https://github.com/corgi-lang/corgi"]("Corgi")".")
        p("Follow Nektro on Twitter @Nektro")
    )
)

Closing

Making compilers is hard but has definitely been fun and I hope this helps demystifies them some.

And now I also have an HTML Proprocessor I'm going to use in as many projects as it makes sense.

Resources:

Follwo me:

Top comments (11)

Riaan Pietersen • Oct 10 '17

This is cool. A great extension would be one where you could mix-in standard Bootstrap markup through use of a keyword, eg.

body(
h1("Corgi Example")
p("This is an example HTML document written in "ahref="github.com/corgi-lang/corgi"".")
p("Follow Nektro on Twitter @nektro ")
boot_accordian[{js/my_accordian.json}]
)

now THAT would save a ton of time methinks. Wonder if the pre-processor could load up that json while compiling and substitute it in (for readability and all.)

Meghan (she/her) • Oct 10 '17 • Edited

Totally. Sometime very soon I was going to add syntax for an import statement that I could work into a (gulp, etc) plugin that could reference other corgi documents.

As it stands right now tags and attributes are allowed a ([a-z0-9-]+) range for the name so custom elements and attributes are already possible. :)

What did you have in mind for the contents of my_accordian.json?

Riaan Pietersen • Oct 10 '17

So if you look at the standard bootstrap3 accordian setup: getbootstrap.com/docs/3.3/javascri... that could be turned into a JSON object, I'm sure. Perhaps something like this:

{
content:{
"First tab":"

This is potentially the content of my first tab

",
"Second tab":"

This is potentially the content of my second tab

",
"Third tab":"

This is potentially the content of my third tab

"
},
settings: {
"setting1":true,
"setting2":"a value"
}
}

Your parser could interpret a shortcode with a json parameter to quickly build the entire structure of the accordian very quickly and cleverly. Perhaps you could allow for passing in variables/settings like in the json above. It could potentially work for any of the preset Bootstrap components, I think?

Meghan (she/her) • Oct 11 '17

Something like this?

gist.github.com/Nektro/b9f499afba0...

rawgit.com/Nektro/b9f499afba0bb1e2...

Riaan Pietersen • Oct 11 '17

That's it! Could it be envoked inline, like the other replacements, through some sort of a tag? This seems quite specific with a lot of javascript triggering it?

Meghan (she/her) • Oct 11 '17

Importing the code from bs4-accordion.html just once, <bs4-accordion> becomes a tag available anywhere in the document. The slightly monotonous init JS is because of the specificity of Bootstrap syntax but with the Custom Element it sets all that up for you.

Riaan Pietersen • Oct 11 '17 • Edited

Sounds ideal then :)

Have you looked at riot.js? Perhaps some knowledge to glean from there too.

Are you happy with it?

Mihovil Ilakovac • Oct 10 '17

I like the way this gives a high level overview of the problem. It was helpful for me as I'm a CS student and have a compilers course atm.

Mt • Oct 10 '17

I think for that you have the .pug files they does the same. Do we need HTML pre-compilers

Meghan (she/her) • Oct 11 '17 • Edited

We don't need preprocessors, especially because my files can't be natively interpreted by a browser, so using this does add a step to your development process depending on how all-in you use a pre-processor. Mileage may vary.

While I was using Pug for another project of mine, this project has been on my backburner for a while and decided to write this post because I got it to the point where I had finally made something potentially useful. On the other hand, as I mentioned, I was using Pug for another project of mine but the other day just converted all those pages to use Corgi instead for the advantages I mentioned above as well as the added bonus I have the satisfaction that the code I publish for my next project is that little bit more mine.