For a personal project, I needed to create a Table of Contents are given markdown content, however, the markdown content was to produce a different HTML output than the CommonMarkdown specification. The rules for getting from Markdown to HTML were similar to how Rust mdBook handles Table of Content.
While it's possible to do an ad-hoc conversion, without paying too much attention to "how it's supposed to be done", I'm learning more about compilers, lexers, parsers, transpilers and so I thought it was a good way to dip my toes into the field.
This tool is a transpiler (code to code), rather than a compiler (higher-level to lower-level code). This is because the two languages we're converting between, Markdown Common Specification and HTML, are markup languages.
Table of Contents
hello
Goal
We want to go from this:
# Title
[About](index.md)
- [Getting Start]()
- [CLI Usage](usage/usage.md)
- [init](usage/init.md)
to this:
<ol class="toc">
<li class="part-title">Title</li>
<li><a href="/about">About</a></li>
<li class="chapter-item">
<strong>1.</strong>
<a class="active" href="/index.html">About</a>
</li>
<li>
<ol class="section">
<li class="chapter-item draft">
<strong>5.1.</strong>
Table of Contents Format
</li>
</ol>
</li>
<li class="part-title">Development</li>
<li class="chapter-item">
<strong>6.</strong>
<a class="" href="/contributing.html">Contributing</a>
</li>
</ol>
How do we go about doing it? Intuitively, we can map the character sets between the two formats. For instance,
# Title
maps to
<li class="part-title">Title</li>
and
[About](index.md)
maps to
<li class="chapter-item">
<strong>1.</strong>
<a class="active" href="/index.html">About</a>
</li>
and
- [init](usage/init.md)
maps to
<li>
<ol class="section">
<li class="chapter-item draft">
<strong>5.1.</strong>
Table of Contents Format
</li>
</ol>
</li>
To go further into mapping between the two sets, we can look at specific characters:
- A hash/number
#
in the markdown format denotes that we are starting a list tag<li></li>
, perhaps we can call thistitle
- A left-side bracket
[
followed by letters, spaces, and/or other characters, then, wrapped by a]
, which then followed by a pair of parentheses()
, denotes a link tag<a></a>
- a dash sign
-
in the Markdown format maps to the beginning of a list item tag<li></li>
in the HTML format. When combined with a link tag, it also adds a numbering, so we can call thislist item
Finally, we wrap all of the character sets with a list item <li></li>
tag.
As you've probably noticed, we haven't introduced any formal terminology or talked about how we're going to be mapping between the two formats, and so that's what we'll look at next.
Terminology and Mapping
Before we start delving into the code, let's introduce some useful terminology for the components that we'll need.
First, we'll need a method for reading the characters from the text stream.
This component is called the Reader: given a text, we iterate through each of the characters.
Next, we'll start combining characters into meaningful tokens, this is called Lexing and is the job of the Lexer.
Afterward, we'll combine the different Tokens to produce a new Syntax Tree. This is called parsing and is done by the Parser.
In the final step, we'll go through the generated Syntax Tree and write a Renderer function which will generate the HTML.
Furthermore, you can divide the different components into two or more stacks: backend and frontend. The frontend is usually responsible for lexing and parsing the code, whereas the backend transforms the code into something else.
In our case the frontend stack consists of the following components:
- Reader
- Lexer
- Parser
And the backend stack consists of only one component:
- Renderer
Reader
The reader is the simplest of the components, the only purpose is to retrieve characters from different formats, be it a string, a file, etc. The reader has an internal state of which character its currently reading, and some helper public methods that are used to retrieve a single character from the stream.
'hello + world' -> ['h','e','l','l','o','+','w','o','r','l','d']
Lexer
The Lexer is the next simplest part of the frontend stack, its main purpose is to create meaningful tokens from a stream of characters (which is also called Lexical Analysis) which it gets from the Reader. It discards unimportant tokens like spaces (if it isn't part of the lexical grammar, like in python). Similarly to the reader, it exposes some public methods that let you extract the newly minted tokens.
['h','e','l','l','o','+','w','o','r','l','d'] -> ['hello', '+', 'world']`
Parser
In our program, the Parser is the most advanced component and is responsible for generating a Syntax or Abstract Syntax Tree, which is a tree structure that represents meaningful grammar. The difference between a Syntax Tree and an Abstract Syntax Tree is that the Syntax Tree includes all information of the source code, including whitespaces which may be superfluous. The AST on the other hand is an abstract representation of the source code that does not include superfluous info.
['hello', '+', 'world']`
->
[
{ token: 'string', lexeme: 'hello' },
{ token: 'string', lexeme: '+' },
{ token: 'string', lexeme: 'world' },
]
Renderer
The renderer does the final transformation, from an AST/Syntax Tree to the desired output language, in our case, HTML.
[
{ token: 'string', lexeme: 'hello' },
{ token: 'string', lexeme: '+' },
{ token: 'string', lexeme: 'world' },
]
->
<div>hello</div>
<div>+</div>
<div>world</div>
The Road to HyperText Markup Language
Next up we'll implement the Reader, Lexer, Parser and Renderer.
1: Build the Reader
Let's start with the simplest component, the Reader. In our case, it's simply a function that takes a string as input and provides some methods to interact with said string. Like almost all of the components, it's a stateful function, it keeps track of what character we are currently reading.
The main methods it exposes are:
- peek: lookup the nth character that is ahead of where we currently are
- consume: return the current character and move to the next character
- isEOF: Check if we have reached the end of the string
function TocReader(chars: string) {
let i = 0;
function peek(nth: number = 0) {
return chars[i + nth];
}
function consume(nth: number = 0) {
const c = chars[i + nth];
i = i + nth + 1;
return c;
}
function isEOF() {
return chars.length - 1 < i;
}
return Object.freeze({
peek,
consume,
isEOF,
});
}
2: Define the grammar
Next up we'll have to identify the tokens we're going to allow in our DSL (Domain-Specific-Language).
We want to take this text:
# Title
[About](index.md)
---
- [Getting Start]()
- [CLI Usage](usage/usage.md)
- [init](usage/init.md)
and find the tokens that represent it. One interpretation is:
| Text | Tokens |
|---------------------|-----------------------------------------------------------------------------------|
| # Title | [HASH, SPACE, STRING] |
| [About](index.md) | [LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, STRING, RIGHT_PAREN] |
| --- | [HORIZONTAL_RULE] |
| - [Getting Start]() | [DASH, SPACE, LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, RIGHT_PAREN] |
| - [Usage](usage.md) | [DASH, SPACE, LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, STRING, RIGHT_PAREN] |
| - [Init](init.md) | [INDENT, SPACE, LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, STRING\|RIGHT_PAREN] |
| | [EOF] |
Then the tokens are:
enum TokenType {
LEFT_PAREN,
RIGHT_PAREN,
LEFT_BRACE,
RIGHT_BRACE,
SPACE,
INDENT,
DASH,
STRING,
HASH,
HORIZONTAL_RULE,
EOF,
}
interface Token {
type: TokenType;
lexeme: string | null;
}
The interface Token
which represents our token, consists of two properties: type
and lexeme
. The type is set to one of the TokenTypes
and the lexeme
is the value of the token, nil
if it is a token that has no value and a string otherwise.
The rules that determine how a particular language groups characters into lexemes
is called Lexical Grammar. Next up we'll build our Lexer.
3: Start Lexing
Similar to the Reader component, the lexer token will have three public methods, which the Parser will use to construct the AST:
- peek
- consume
- isEOF
A lexer is a stateful object, it contains a pointer/index to the current element being read and an array of tokens. The core of the lexer is the function responsible for actually lexing: creating tokens from a string of characters.
The private method scanToken
is responsible for lexing. It steps through the input stream (the output of the reader), which we trigger by the run
method, which is a while loop that runs the scanToken
function until we hit the EOF
character.
Now, there are several ways to implement the scanToken
method, we choose a simple switch, albeit you can use something like a recursive state function that calls itself until it hits the EOF
character.
In this case, I choose a switch statement instead, where the outer loop in the run
function is responsible for running the scanToken
method until we hit EOF
.
The switch statement looks at the current character being read from the Reader and tries to match it against a token in the lexer grammar. For simple tokens, such as LEFT_BRACE
, it can match 1 to 1 and add the token directly, but for others, such as the HORIZONTAL_RULE
, it uses reader.peek
to look ahead. For more complicated symbols, such as strings, we define a special function that has to look several characters ahead.
function TocLexer(reader: any) {
let i = 0;
const tokens: Token[] = [];
run();
function run() {
while (!reader.isEOF()) {
scanToken();
}
addToken(TokenType.EOF);
}
function scanToken() {
const c = reader.consume();
switch (c) {
case '\n':
break;
case ' ':
if (reader.peek() === ' ') {
addToken(TokenType.INDENT);
reader.consume();
}
break;
case '[':
addToken(TokenType.LEFT_BRACE);
break;
case ']':
addToken(TokenType.RIGHT_BRACE);
break;
case '(':
addToken(TokenType.LEFT_PAREN);
break;
case ')':
addToken(TokenType.RIGHT_PAREN);
break;
case '#':
header();
break;
case '-':
if (reader.peek() === '-' && reader.peek(1) === '-') {
reader.consume();
reader.consume();
addToken(TokenType.HORIZONTAL_RULE);
} else {
addToken(TokenType.DASH);
}
break;
default:
stringLiteral(c);
break;
}
}
function header() {
let text = '';
while (reader.peek() !== '\n' && !reader.isEOF()) {
text += reader.consume();
}
text = text.trim();
addToken(TokenType.HASH, text);
}
function stringLiteral(c: string) {
let text = c;
let unClosedParens = 0;
let unClosedBraces = 0;
while (reader.peek() !== '\n' && !reader.isEOF()) {
const nth = reader.peek();
if (nth === '(') {
unClosedParens += 1;
} else if (nth === ')' && unClosedParens === 0) {
break;
} else if (nth === ')') {
unClosedParens -= 1;
} else if (nth === '[') {
unClosedBraces += 1;
} else if (nth === ']' && unClosedBraces === 0) {
break;
} else if (nth === ']') {
unClosedBraces -= 1;
}
// All BRACE and PARENS must be closed
text += reader.consume();
}
addToken(TokenType.STRING, text);
}
function addToken(type: TokenType, lexeme: string | null = null) {
tokens.push({ type, lexeme });
}
function peek(nth: number = 0) {
return tokens[i + nth].type;
}
function consume(nth: number = 0) {
const t = tokens[i + nth];
i = i + nth + 1;
return t;
}
function isEOF() {
return tokens.length - 1 < i;
}
function printTokens() {
tokens.map(k => {
log.info(`type: ${TokenType[k.type]} lexeme: ${k.lexeme}`);
});
}
function printTokenType() {
tokens.map(k => {
log.info(`${TokenType[k.type]}`);
});
}
return Object.freeze({ peek, consume, isEOF, printTokens, printTokenType });
}
Finally, after we have lexed the input, we will end up with an array containing the tokens
:
const token[] Token = [
{
type: 'LEFT_BRACE',
lexeme: nil,
},
{
type: 'STRING',
lexeme: 'About',
},
{
type: 'RIGHT_BRACE',
lexeme: nil,
},
{
type: 'LEFT_PAREN',
lexeme: nil,
},
{
type: 'STRING',
lexeme: 'index.md',
},
{
type: 'RIGHT_PARENT',
lexeme: nil,
},
];
4: Generate an Abstract Syntax Tree
Now that have our tokens, we can start assembling them to create an AST. To do so, we have to create a formal language of how the AST is built, which will we do by creating a formal(ish) grammar.
Formal grammar is a pseudo-code(ish) description of our DSL. There are academic words for this, one is parsing expression grammar (PEM), another one is Top-down parsing language (TDPL), and yet another one is Context-free grammar (CFG). They're all very formal, and since this is a less formal project, we won't be strict with our language.
Next up we'll create a set of production rules, which are a bunch of rewrite rule that are used to substitute symbols with other values that can be recursively performed to generate new symbol sequences.
The production rules are of the form:
A -> a
Where A
is a nonterminal symbol, and a
a string of terminals and/or nonterminals.
expression -> hr | header | link | list | expression
hr -> ---
header -> # STRING
link -> "[" STRING "]" "(" STRING? ")" | list
list -> "INDENT"+ "-" link
terminal and nonterminal are defined as follows:
-
terminal: when we encounter terminal symbols we don't do any substitutions. So for instance,
---
,STRING
are terminal symbols -
nonterminal: nonterminal symbols can contain other symbols, and so we have to examine them until we get to a terminal symbol. In our case,
expression
,hr
,header
,link
list
are nonterminal symbols
Now let's implement this formal grammar. We begin with the parse
function that will run through the tokens we got from the lexer until it reaches the EOF
token. For each token, it will parse the expression (with a switch statement) using the peek
method from the Lexer.
function TocParser(lexer: any) {
function hr() {
return { type: 'hr', indent: 0 };
}
function header(text: string) {
return { type: 'header', text, indent: 0 };
}
function link(title: string, ref: string = '') {
return { type: 'link', title, ref, indent: 0 };
}
function list(children: any[] = []) {
return { type: 'list', children };
}
function listItem(title: string, ref: string = '', indent: number): any {
return {
type: 'listItem',
title,
ref,
indent,
};
}
function createAST(statements: any): any {
const listRefs: any = { 0: list() };
statements.forEach((v: any, i: number, arr: any) => {
if (i > 0 && arr[i - 1].indent > v.indent) {
delete listRefs[arr[i - 1].indent];
}
if (listRefs.hasOwnProperty(v.indent)) {
listRefs[v.indent].children.push(v);
} else {
listRefs[v.indent] = {
type: 'list',
children: [v],
};
listRefs[v.indent - 1].children.push(listRefs[v.indent]);
}
});
return listRefs[0];
}
function parse() {
const statements = [];
while (!lexer.isEOF()) {
const expr = expression();
if (expr) {
statements.push(expr);
}
}
return createAST(statements);
}
function expression(numIndent = 0): any {
const token = lexer.consume();
switch (token.type) {
case TokenType.HORIZONTAL_RULE:
return hr();
case TokenType.HASH:
return header(token.lexeme);
case TokenType.LEFT_BRACE:
return parseLink();
case TokenType.INDENT:
return expression(numIndent + 1);
case TokenType.DASH:
return parseListItem(numIndent);
default:
}
}
function parseListItem(numIndent = 0): any {
if (
lexer.peek(0) === TokenType.LEFT_BRACE &&
lexer.peek(1) === TokenType.STRING &&
lexer.peek(2) === TokenType.RIGHT_BRACE
) {
const title = lexer.consume(1);
let ref = '';
if (lexer.peek(2) === TokenType.STRING) {
ref = lexer.consume(2);
}
lexer.consume(); // Consume right parens
return listItem(title.lexeme, ref.lexeme, numIndent);
}
}
function parseLink(): any {
if (
lexer.peek() === TokenType.STRING &&
lexer.peek(1) === TokenType.RIGHT_BRACE
) {
const title = lexer.consume();
const ref = lexer.consume(2);
lexer.consume();
return link(title.lexeme, ref.lexeme);
}
}
return parse();
}
For instance, if we were to parse the following markdown snippet:
# markBook
---
[About](index.md)
- [Getting Start](getting-started.md)
- [Getting Start](getting-started.md)
- [Getting Start](getting-started.md)
We would end up with this AST:
{
"type": "list",
"children": [
{
"type": "header",
"text": "markBook",
"indent": 0
},
{
"type": "hr",
"indent": 0
},
{
"type": "link",
"title": "About",
"ref": "index.md",
"indent": 0
},
{
"type": "listItem",
"title": "Getting Start",
"ref": "getting-started.md",
"indent": 0
},
{
"type": "list",
"children": [
{
"type": "listItem",
"title": "Getting Start",
"ref": "getting-started.md",
"indent": 1
},
{
"type": "list",
"children": [
{
"type": "listItem",
"title": "Getting Start",
"ref": "getting-started.md",
"indent": 2
}
]
}
]
}
]
}
5: Generate HTML
Now that we have our AST, we can implement the Renderer which will generate the desired HTML. The function responsible for this transformation is TocRender
, a function that takes as input the AST object and loops through it to generate our HTML.
We start by writing our main loop which will handle all of the AST types and wrap it in an ol
tag:
<ol class="toc">${ast.children.map((e: any) => {
switch (e.type) {
case "hr":
return hr();
case "header":
return header(e);
case "link":
return link(e);
case "listItem":
order += 1;
return listItem(e, [order]);
case "list":
return list(e, [order]);
default:
}
})
.join("")
}
</ol>`;
Next, we write functions to handle the different HTML tags:
function formatUrl(currentFileName) {
let link = stripExtension(currentFileName);
link = link.replace(site.paths.content, "");
return link;
}
function stripExtension(url) {
let link = path.relative(site.paths.content, url);
link = path.join("/", path.dirname(url), path.basename(url, ".md"));
if (site.uglyURLs) {
link += ".html";
}
return link;
}
function hr() {
return '<li class="spacer"></li>';
}
function header(e: any) {
return `<li class="part-title">${e.text}</li>`;
}
function link(e: any): any {
let ref = e.ref !== "" ? stripExtension(e.ref) : "";
const linkClass = ref === activePage ? "active" : "";
// We treat index.md in root file differently
if (ref === "/index") {
ref = "/";
}
if (site.rootUrl) {
ref = `${site.rootUrl}${ref}`
}
return ref
? `<li><a class="${linkClass}" href="${ref}">${e.title}</a></li>`
: `<li class="draft">${e.title}</li>`;
}
function listItem(e: any, order: number[]) {
let ref = e.ref !== "" ? stripExtension(e.ref) : "";
const linkClass = ref === activePage ? "active" : "";
// We treat index.md in root file differently
if (ref === "/index") {
ref = "/";
}
if (site.rootUrl) {
ref = `${site.rootUrl}${ref}`
}
return ref
? `
<li class="chapter-item">
<strong>${[...order, ""].join(".")}</strong>
<a class="${linkClass}"
href="${ref}">${e.title}</a>
</li>
`
: `
<li class="chapter-item draft">
<strong>${[...order, ""].join(".")}</strong>
${e.title}
</li>
`;
}
function list(e: any, order: number[]): any {
return `
<li>
<ol class="section">
${
e.children
.map((node: any, i: number) =>
node.type === "list"
? list(node, [...order, i + 1])
: listItem(node, [...order, i + 1])
)
.join("")
}
</ol>
</li>
`;
}
There are some ad-hoc manipulations since we need to handle some specific functionality, such as adding an active
HTML class when a page is active, or when we're dealing with an index.html
page.
Finally, the TocRender
function in its entirety.
function TocRender(site: Site, ast: any, currentFileName: string) {
const activePage = formatUrl(currentFileName);
function formatUrl(currentFileName) {
let link = stripExtension(currentFileName);
link = link.replace(site.paths.content, "");
return link;
}
function stripExtension(url) {
let link = path.relative(site.paths.content, url);
link = path.join("/", path.dirname(url), path.basename(url, ".md"));
if (site.uglyURLs) {
link += ".html";
}
return link;
}
function hr() {
return '<li class="spacer"></li>';
}
function header(e: any) {
return `<li class="part-title">${e.text}</li>`;
}
function link(e: any): any {
let ref = e.ref !== "" ? stripExtension(e.ref) : "";
const linkClass = ref === activePage ? "active" : "";
// We treat index.md in root file differently
if (ref === "/index") {
ref = "/";
}
if (site.rootUrl) {
ref = `${site.rootUrl}${ref}`
}
return ref
? `<li><a class="${linkClass}" href="${ref}">${e.title}</a></li>`
: `<li class="draft">${e.title}</li>`;
}
function listItem(e: any, order: number[]) {
let ref = e.ref !== "" ? stripExtension(e.ref) : "";
const linkClass = ref === activePage ? "active" : "";
// We treat index.md in root file differently
if (ref === "/index") {
ref = "/";
}
if (site.rootUrl) {
ref = `${site.rootUrl}${ref}`
}
return ref
? `
<li class="chapter-item">
<strong>${[...order, ""].join(".")}</strong>
<a class="${linkClass}"
href="${ref}">${e.title}</a>
</li>
`
: `
<li class="chapter-item draft">
<strong>${[...order, ""].join(".")}</strong>
${e.title}
</li>
`;
}
function list(e: any, order: number[]): any {
return `
<li>
<ol class="section">
${
e.children
.map((node: any, i: number) =>
node.type === "list"
? list(node, [...order, i + 1])
: listItem(node, [...order, i + 1])
)
.join("")
}
</ol>
</li>
`;
}
let order = 0;
return `<ol class="toc">${
ast.children
.map((e: any) => {
switch (e.type) {
case "hr":
return hr();
case "header":
return header(e);
case "link":
return link(e);
case "listItem":
order += 1;
return listItem(e, [order]);
case "list":
return list(e, [order]);
default:
}
})
.join("")
}</ol>`;
}
6: Results
Finally, we have our HTML. Here is an overview of the steps we took and what the data structure looks like for each stage.
Fig.1 - Flowchart
Final Thoughts
There are a lot of missing and important parts to writing transpilers that I've omitted:
- Proper error handling (how the users writing the DSL language will get information on syntax and grammar errors),
- Writing tests
- Improve performance
- Being a bit more idiomatic in how the parser generates the AST
- Switch to recursive state function for the Lexer and Parser
But for a pet project, this will suffice. There's also a bunch of lessons I've learned along the way when writing transpilers:
- Start with Lexical Grammar and the AST definition, and not with programming!
- Transpilers is an excellent candidate for unit testing
- When debugging, make sure your lexer behaves! Don't always assume the lexer behaves correctly when debugging the parser
Resources
Some useful resources if you want to learn more about compilers, transpilers, and parsers in general.
Top comments (0)