✏️Preface
I was always interested in how programming languages worked. To be specific, how what we write in a human readable way is interpreted my the machine. I did have an bird's eye view of what would happen, but it was and still kind of is like black magic to me. So, now I have decided to learn these magic verses and this is my first step towards it.
ℹ️First Steps
To not overwhelm myself, I wanted to start with something easy and which is used almost everyday. This way, I will be able to understand, solve it and actually have motivation to move forward rather than just learning amidst the hype is going on in my mind and then when it shimmers down I would eventually drop it.
So, I decided to write a JSON parser.
The goal was simple:
Input a string
Output should be
JSON
orError
(depending on the input)
I wanted to start with golang but eventually chose to go with javascript, the reason being simply that I do not wanted to divide my focus on learning many things at once. Hence, javascript it is.
⚠️I had no idea
Well, at first I tried doing it myself without having any idea of how to do. Pretty quickly, I was deep into for loops and regex territory. My intention was never to be able to solve this on my own but to actually learn where and what things I get stuck on. This way, when I actually refer something I can make a bridge from my thoughts and what the actual logic is.
After struggling for a while I started to notice what problems needed to be solved.
I needed some way to proper differentiate different kinds of words/symbols/values.
After differentiating them, I needed to validate each of them and put them back together.
👾Wow, this is interesting as hell!
Ok, now I had a basic understanding of what I needed to do and my mind was warmed up enough. Alright, let's start digging up the internet. After a while, I stumbled upon an article from Oz named Write Your Own JSON Parser with Node and Typescript.
This article does a fantastic job to decode the logic in a very step by step manner. I encourage you to give this one a read.
So basically, we need a tokenizer
and parser
. Tokenizer will identify the different types of token. Hey, this solves one of my problems. Ok, so rather than using different terms like words, symbols, characters we call them tokens. Seeing an example json
there are many different kinds of tokens which we need to identify.
{
"id": "647ceaf3657eade56f8224eb",
"index": 0,
"something": [],
"boolean": true,
"nullValue": null
}
There is {
(Open Brace), }
(Close Brace), [
(Open Bracket), ]
(Close Bracket), strings like "id"
, :
colon, ,
commas and much more.
Alright, now if we were to tokenize this we should have information for basically two things,
Type of Token
Value of Token
We would store this array of tokens and then feed this to the parser which will then take it forward. So our end result should look something like this,
[
{ type: "BraceOpen", value: "{" },
{ type: "String", value: "id" },
{ type: "Colon", value: ":" },
{ type: "String", value: "647ceaf3657eade56f8224eb" },
{ type: "Comma", value: "," },
{ type: "String", value: "index" },
{ type: "Colon", value: ":" },
{ type: "Number", value: "0" },
{ type: "Comma", value: "," },
{ type: "String", value: "something" },
{ type: "Colon", value: ":" },
{ type: "BracketOpen", value: "[" },
{ type: "BracketClose", value: "]" },
{ type: "Comma", value: "," },
{ type: "String", value: "boolean" },
{ type: "Colon", value: ":" },
{ type: "True", value: "true" },
{ type: "Comma", value: "," },
{ type: "String", value: "nullValue" },
{ type: "Colon", value: ":" },
{ type: "Null", value: "null" },
{ type: "BraceClose", value: "}" },
];
🏷️Tokenizing
We will make a function named tokenizer which will take a string as an input and give us an array of token objects. We will iterate the string character by character and classify it based on some logic.
Oz's article does a very nice job of explaining the logic for token classification. However, it still skips out on somethings. The main fun which I had in making this was tackling the issues which the article didn't handled.
But let's start with the basics here, first we need a variable to track our position in the string and an array to store different token objects.
/**
* JSON Tokenizer
*
* @param { String } string
* @returns { any[] }
*/
function tokenizer(string) {
let current = 0;
const tokens = [];
while (current < string.length) {
let char = string[current];
current++;
}
return tokens;
}
Now we will add some basic conditions to extract of tokens like {
, }
, etc.
/**
* JSON Tokenizer
*
* @param { String } string
* @returns { any[] }
*/
function tokenizer(string) {
let current = 0;
const tokens = [];
while (current < string.length) {
let char = string[current];
if (char === "{") {
tokens.push({ type: "BraceOpen", value: char });
current++;
continue;
}
if (char === "}") {
tokens.push({ type: "BraceClose", value: char });
current++;
continue;
}
if (char === "[") {
tokens.push({ type: "BracketOpen", value: char });
current++;
continue;
}
if (char === "]") {
tokens.push({ type: "BracketClose", value: char });
current++;
continue;
}
if (char === ":") {
tokens.push({ type: "Colon", value: char });
current++;
continue;
}
if (char === ",") {
tokens.push({ type: "Comma", value: char });
current++;
continue;
}
// If it whitespace, ignore it.
if (/\s/.test(char)) {
current++;
continue;
}
// For all other characters throw error
throw new Error("Unexpected character: " + char);
}
return tokens;
}
All right, let's now tackle the trickiest things. Strings, Numbers, Boolean and Null values.
Let's start with Strings
. So, in json
each string starts and ends with "
. This is the key part in our logic. When we encounter a "
, we then start iteraing and storing all the values after it until we encounter another "
. Then we store that as a value and increment our current position in the string.
if (char === '"') {
let value = "";
char = string[++current];
while (char !== '"') {
value += char;
char = string[++current];
}
current++;
tokens.push({ type: "String", value: value });
continue;
}
For Boolean
and Null
values the logic is almost the same, just the condition for entering and breaking the loop is different.
// Test if it starts with a character
if (/[a-zA-z]/.test(char)) {
let value = "";
// Loop until it is not a character
while (/[a-zA-Z]/.test(char)) {
value += char;
char = string[++current];
}
if (value === "true") {
tokens.push({ type: "True", value });
} else if (value === "false") {
tokens.push({ type: "False", value });
} else if (value === "null") {
tokens.push({ type: "Null", value });
} else {
throw new Error("Unexpected value: " + value);
}
continue;
}
Now, for the last type Number
. This is was the most fun part for me because the article which I was referring only dealt with Int
numbers, if I tried a Float
, or -ve
number it didn't work. For recognizing something as a number there are some rules in json
.
Basically, 69
, -69
, 6.9
, -6.9
, 6e9
, -6e9
, 6.e9
, -6.e9
are all valid numbers. We will first see if the current character is -
or a number. If it is then we will again repeat the same thing as we did for strings, boolean and null. We will start capturing the value, unless it is not a digit, not an e
, -
, .
, +
. But I hear you say, would this be wrong, this could parse 1.1.1
as number. You are right! So we will then add a check if it truly is a Number
or not.
if (char === "-" || /\d/.test(char)) {
let value = "";
while (
char === "+" ||
char === "-" ||
char === "e" ||
char === "." ||
/\d/.test(char)
) {
value += char;
char = string[++current];
}
if (isNumber(value)) {
tokens.push({ type: "Number", value });
} else {
throw new Error("Unexpected value: " + value);
}
continue;
}
/**
* Checks for Number
*
* @param { String } value
* @returns { Boolean }
*/
function isNumber(value) {
return !isNaN(Number(value));
}
The final version of tokenizer function looks like this.
/**
* Checks for Number
*
* @param { String } value
* @returns { Boolean }
*/
function isNumber(value) {
return !isNaN(Number(value));
}
/**
* JSON Tokenizer
*
* @param { String } string
* @returns { any[] }
*/
function tokenizer(string) {
let current = 0;
const tokens = [];
while (current < string.length) {
let char = string[current];
if (char === "{") {
tokens.push({ type: "BraceOpen", value: char });
current++;
continue;
}
if (char === "}") {
tokens.push({ type: "BraceClose", value: char });
current++;
continue;
}
if (char === "[") {
tokens.push({ type: "BracketOpen", value: char });
current++;
continue;
}
if (char === "]") {
tokens.push({ type: "BracketClose", value: char });
current++;
continue;
}
if (char === ":") {
tokens.push({ type: "Colon", value: char });
current++;
continue;
}
if (char === ",") {
tokens.push({ type: "Comma", value: char });
current++;
continue;
}
if (char === '"') {
let value = "";
char = string[++current];
while (char !== '"') {
value += char;
char = string[++current];
}
current++;
tokens.push({ type: "String", value: value });
continue;
}
if (char === "-" || /\d/.test(char)) {
let value = "";
while (
char === "+" ||
char === "-" ||
char === "e" ||
char === "." ||
/\d/.test(char)
) {
value += char;
char = string[++current];
}
if (isNumber(value)) {
tokens.push({ type: "Number", value });
} else {
throw new Error("Unexpected value: " + value);
}
continue;
}
if (/[a-zA-z]/.test(char)) {
let value = "";
while (/[a-zA-Z]/.test(char)) {
value += char;
char = string[++current];
}
if (value === "true") {
tokens.push({ type: "True", value });
} else if (value === "false") {
tokens.push({ type: "False", value });
} else if (value === "null") {
tokens.push({ type: "Null", value });
} else {
throw new Error("Unexpected value: " + value);
}
continue;
}
if (/\s/.test(char)) {
current++;
continue;
}
throw new Error("Unexpected character: " + char);
}
return tokens;
}
🖨️Parsing
The parser is where we make sense out of our tokens. Now we have to build our Abstract Syntax Tree (AST). The AST represents the structure and meaning of the code in a hierarchical tree-like structure. It captures the relationships between different elements of the code, such as statements, expressions, and declarations.
Every language or format you can think of uses some form of AST based on grammar rules of the programming language or data format being parsed.
Alright, so we will code for both things, to output the AST representation from the tokens and also an actual json
object.
The logic again is kinda similar, we will iterate through the tokens and map it based on their type one by one.
/**
* Parser + AST for tokens
*
* @param { any[] } tokens
* @returns { any }
*/
function parser(tokens) {
if (!tokens.length) {
throw new Error("Nothing to parse. Exiting!");
}
let current = 0;
function advance() {
return tokens[++current];
}
function parseValue() {
const token = tokens[current];
// Gives ASTNode
switch (token.type) {
case "String":
return { type: "String", value: token.value };
case "Number":
return { type: "Number", value: Number(token.value) };
case "True":
return { type: "Boolean", value: true };
case "False":
return { type: "Boolean", value: false };
case "Null":
return { type: "Null" };
case "BraceOpen":
return parseObject(); // To be implemented
case "BracketOpen":
return parseArray(); // To be implemented
default:
throw new Error(`Unexpected token type: ${token.type}`);
}
}
return parseValue();
}
Alright, but whenever we see {
or [
we need to handle the nested objects and arrays recursively which means calling parseValue()
again from parseObject()
and parseArray()
.
function parseObject() {
const node = { type: "Object", value: {} };
let token = advance();
// Iterate through the tokens until we reach a BraceClose '}'
while (token.type !== "BraceClose") {
if (token.type === "String") {
const key = token.value;
token = advance();
if (token.type !== "Colon") {
throw new Error("Expected : in key-value pair");
}
token = advance(); // Skip ':'
const value = parseValue(); // Recursively parse the value
node.value[key] = value;
} else {
throw new Error(`Unexpected key. Token type: ${token.type}`);
}
token = advance(); // Increment one step
if (token.type === "Comma") token = advance(); // Skip ','
}
// Gives ASTNode
return node;
}
function parseArray() {
const node = { type: "Array", value: [] };
let token = advance(); // Skip '['
while (token.type !== "BracketClose") {
const value = parseValue();
node.value.push(value);
token = advance(); // Increment one step
if (token.type === "Comma") token = advance(); // Skip ','
}
// Gives ASTNode
return node;
}
Now our parser is ready, here is what the final implementation looks like.
/**
* Checks for Number
*
* @param { String } value
* @returns { Boolean }
*/
function isNumber(value) {
return !isNaN(Number(value));
}
/**
* JSON Tokenizer
*
* @param { String } string
* @returns { any[] }
*/
function tokenizer(string) {
let current = 0;
const tokens = [];
while (current < string.length) {
let char = string[current];
if (char === "{") {
tokens.push({ type: "BraceOpen", value: char });
current++;
continue;
}
if (char === "}") {
tokens.push({ type: "BraceClose", value: char });
current++;
continue;
}
if (char === "[") {
tokens.push({ type: "BracketOpen", value: char });
current++;
continue;
}
if (char === "]") {
tokens.push({ type: "BracketClose", value: char });
current++;
continue;
}
if (char === ":") {
tokens.push({ type: "Colon", value: char });
current++;
continue;
}
if (char === ",") {
tokens.push({ type: "Comma", value: char });
current++;
continue;
}
if (char === '"') {
let value = "";
char = string[++current];
while (char !== '"') {
value += char;
char = string[++current];
}
current++;
tokens.push({ type: "String", value: value });
continue;
}
if (char === "-" || /\d/.test(char)) {
let value = "";
while (
char === "+" ||
char === "-" ||
char === "e" ||
char === "." ||
/\d/.test(char)
) {
value += char;
char = string[++current];
}
if (isNumber(value)) {
tokens.push({ type: "Number", value });
} else {
throw new Error("Unexpected value: " + value);
}
continue;
}
if (/[a-zA-z]/.test(char)) {
let value = "";
while (/[a-zA-Z]/.test(char)) {
value += char;
char = string[++current];
}
if (value === "true") {
tokens.push({ type: "True", value });
} else if (value === "false") {
tokens.push({ type: "False", value });
} else if (value === "null") {
tokens.push({ type: "Null", value });
} else {
throw new Error("Unexpected value: " + value);
}
continue;
}
if (/\s/.test(char)) {
current++;
continue;
}
throw new Error("Unexpected character: " + char);
}
return tokens;
}
/**
* Parser + AST for tokens
*
* @param { any[] } tokens
* @returns { any }
*/
function parser(tokens) {
if (!tokens.length) {
throw new Error("Nothing to parse. Exiting!");
}
let current = 0;
function advance() {
return tokens[++current];
}
function parseValue() {
const token = tokens[current];
// Gives ASTNode
// switch (token.type) {
// case "String":
// return { type: "String", value: token.value };
// case "Number":
// return { type: "Number", value: Number(token.value) };
// case "True":
// return { type: "Boolean", value: true };
// case "False":
// return { type: "Boolean", value: false };
// case "Null":
// return { type: "Null" };
// case "BraceOpen":
// return parseObject();
// case "BracketOpen":
// return parseArray();
// default:
// throw new Error(`Unexpected token type: ${token.type}`);
// }
// Gives value
switch (token.type) {
case "String":
return token.value;
case "Number":
return Number(token.value);
case "True":
return true;
case "False":
return false;
case "Null":
return null;
case "BraceOpen":
return parseObject();
case "BracketOpen":
return parseArray();
default:
throw new Error(`Unexpected token type: ${token.type}`);
}
}
function parseObject() {
const node = { type: "Object", value: {} };
let token = advance();
while (token.type !== "BraceClose") {
if (token.type === "String") {
const key = token.value;
token = advance();
if (token.type !== "Colon") {
throw new Error("Expected : in key-value pair");
}
token = advance();
const value = parseValue();
node.value[key] = value;
} else {
throw new Error(`Expected String key in object. Token type: ${token.type}`);
}
token = advance();
if (token.type === "Comma") token = advance();
}
// Gives ASTNode
// return node;
// Gives Value
return node.value;
}
function parseArray() {
const node = { type: "Array", value: [] };
let token = advance();
while (token.type !== "BracketClose") {
const value = parseValue();
node.value.push(value);
token = advance();
if (token.type === "Comma") token = advance();
}
// Gives ASTNode
// return node;
// Gives Value
return node.value;
}
return parseValue();
}
let output = parser(
tokenizer(
JSON.stringify({
products: [
{
id: 1,
title: "Essence Mascara Lash Princess",
category: "beauty",
price: 9.99,
discountPercentage: 7.17,
rating: 4.94,
stock: 5,
tags: ["beauty", "mascara"],
brand: "Essence",
sku: "RCH45Q1A",
weight: 2,
dimensions: {
width: 23.17,
height: 14.43,
depth: 28.01,
},
warrantyInformation: "1 month warranty",
shippingInformation: "Ships in 1 month",
availabilityStatus: "Low Stock",
reviews: [
{
rating: 5,
comment: "Very satisfied!",
date: "2024-05-23T08:56:21.618Z",
reviewerName: "Scarlett Wright",
reviewerEmail: "scarlett.wright@x.dummyjson.com",
},
],
returnPolicy: "30 days return policy",
minimumOrderQuantity: 24,
meta: {
createdAt: "2024-05-23T08:56:21.618Z",
updatedAt: "2024-05-23T08:56:21.618Z",
barcode: "9164035109868",
qrCode: "https://dummyjson.com/public/qr-code.png",
},
images: [
"https://cdn.dummyjson.com/products/images/beauty/Essence%20Mascara%20Lash%20Princess/1.png",
],
thumbnail:
"https://cdn.dummyjson.com/products/images/beauty/Essence%20Mascara%20Lash%20Princess/thumbnail.png",
},
],
total: 194,
skip: 0,
limit: 30,
})
)
);
console.log(JSON.stringify(output, "", 2));
🔭Conclusion
This wasn't even that time consuming and I really learned some useful concepts in during this. Now, since I have a understanding how this works; I can take up any new language which I want to learn and implement this as a proof of concept in it.
If you’d like to know more about my journey, please feel free to reach out or follow me!
Github: Hetarth02
LinkedIn: Hetarth Shah
Website: Portfolio
💖Thank you for joining me on this journey, and look forward to more interesting articles!
Credits:
Cover Image from, Image Source.
Article Images from, json.org.
References:
Top comments (2)
Very nice! Just one thing caught my eye, the part that tokenizes string will break on something like this:
{"foo":"ba\"r"}
.Other than that, nice work!
Thanks for the point out. Will try to fix it.(Any hints are appreciated!)