My name is Will, alias Hexydec, and I am a developer. I have been developing websites professionally for 23 years, and I have always been obsessed with performance, how you can optimise your web application to give users the best possible experience.
In this article I want to talk about minification, the process of removing extraneous bytes from the code you deliver to the client, to make the payload as small as possible. I want to talk about all the open source PHP minifiers there are out there, and try to show how they stack up against each other.
First let me fess up here, I have written a number of PHP minification programs in my career, and 3 years ago I started my own GitHub project to write an HTML minifier in PHP. Partly this article is to present that software to you, but it is also to show you the research I did myself to see how my project stacked up against the competition and improve my software, hopefully you will find the results interesting, as I did.
TL;DR
If you are not interested in how all this came about and just want to skip straight to the results, read part 2.
How does a minifier work?
I just went back and looked at the first commit for this project, it looks like I started with the previous minfiers I wrote for HTML and CSS, and I was planning to use a program called JSMin to minify any inline javascript.
The HTML minifier I committed used some loops of Regular Expressions to find and replace patterns within the code to remove the unwanted whitespace etc, but the CSS minifier was a much more interesting beast – it broke the CSS down into an internal array that represented the document, changed some properties in the array, and then compiled it back to CSS – it is a compiler.
So why did I write a pattern based replacement mechanism for HTML and a compiler for CSS (I wrote it years ago)? I am guessing this is because the structure of CSS is much flatter, and much more predicable. You will see later on how this plays out with the other minifiers I tested.
So how does a minifier work? Well it either:
- takes the input code, uses regular expressions to find patterns within the code, and replaces them with the same code minus all the bits you don’t need, or;
- it splits the input code into tokens, loops through them, consuming each one to generate an internal representation of the document, performs optimisations on that representation, and then compiles it back to code
Regexp or Compiler, which is better?
It really depends on what you are trying to do, regular expressions alone are faster, but they have their limits. The issue with them is that they just look for patterns, what is matched doesn’t have any context.
If you have a language like CSS, it is quite predictably structured, this would be easier to use regular expressions to minify, but even so you will need some callback code to process the values to do individual optimisations.
Compare this to a language like Javascript, where the code can be infinitely nested, and is full of pitfalls such as determining whether a line ending is the end of a command, or matching regular expressions in the Javascript code, suddenly you will find a lot of edge cases which need more and more code to handle – it ends up feeling like whack-a-mole.
A compiler on the other hand has a number of stages to parse the code into a more understandable structure, therefore it is slower, but the result is that each bit of code has a relationship to the other bits of code, and once you have it in this format it is then much easier to work with.
Edge cases are now much easier to handle because your code has an understanding of what each piece of code is, it is then simpler to work out what you can do to it. You can also handle errors in the input code, the context enables you to know what to expect next, and if the token that appears is not what is expected, then you can discard it or throw an error, whatever is appropriate for what you are trying to achieve.
Because of the context, you can also perform more optimisations than you would be able to with regular expressions. Take HTML minification for example, an optimisation such as removing closing tags is much easier when you know what tags are next to or ancestors of the current tag.
So with regular expressions you get speed and ease of development, with compilers you get reliability, the ability to handle more complex input, handle errors, and better optimisation.
There is also a third option, as sort of halfway house, I am not sure what you would call it, so I am going to call it a “Linear Consumer”. It processes the input in a linear fashion, using a tokeniser, it then processes the code and spits out the optimised code as the tokens are generated.
By doing all the processing as you go, you don’t need to create an intermediate representation, thus improving performance, but the tradeoff is that you can only perform optimisations within close range of the current tokeniser position.
How does my HTML minifier work?
My GitHub project is called HTMLdoc, and as you might have guessed it is a compiler. It does use regular expressions, but only to tokenise the input. This keeps the regular expressions fairly simple; it just splits the input into bits ready for consumption by the parser.
Note that I have tried writing a string based tokeniser, but regular expressions are about 3 times faster. This is probably because of the amount of userland code required to chop the input up, whereas with RegExp all that work is done in optimised C code.
The output tokens are then parsed into objects, each representing a part of the code, such as a tag, textnode, or comment. A minify()
method then distributes the minify command to each object in the tree, which performs any optimisation on each bit of the code.
Finally the compile()
command orders each object to render itself and its children as HTML. There is a fair amount of complexity in doing all of this, hence the reason for using objects which splits the functionality into smaller more modular bits, I have also tried to make it all as configurable as possible, so you can control things like which tags are pre-formatted, and what minification optimisations to perform.
You can also query the object using CSS expressions to extract text and attributes. If you have ever used a program called simple_dom_html
then this functionality is akin to that, but with a built in minifier (and actually maintained by someone!).
CSSdoc: Project Number 2
So I already had an old CSS compiler, and in my pursuit of the highest compression, it was clear that I needed to rewrite this to compile any inline CSS in the HTML, enter another Github project - CSSdoc.
It uses the same sort of code layout and tokeniser as HTMLdoc, but due to CSS’s predictable structure, I elected not to retain the input whitespace, this was for better memory and performance. Instead it has two compiler modes, minify and beautify.
JSlite: Project Number 3
I mentioned earlier that I was planning on using JSMin to minify any inline Javascript, if you have used any other minifier in PHP, you have probably used JSMin. It is a port of Douglas Crockfords Javascript minifier, and was (and kinda still is) the defacto PHP minifier for Javascript. Most Wordpress plugins use this behind the scenes.
But the port is not actively maintained (Last commit was Dec 2018), and was created before ES6, so I started another project called JSlite. This time I didn’t want to be too heavy handed with the compression, the goal was to minify inline Javascript in a webpage on the fly, if I could just remove the unneeded whitespace, I figured that would be enough.
I started out with a big regular expression to do this, I wanted the code to be as fast as possible. But it soon became clear that this was not going to cut it. So I redeveloped it as a compiler.
Tokenise: Project Number 4
All 3 projects now used similar code layouts and a similar tokeniser, so I unified the tokeniser across the projects by moving the tokeniser into its own GitHub project. I also had to put all the projects on packagist and bring the projects together using composer.
Unit Tests, Bugs, and Production
I always wanted to make sure my code was reliable, therefore along the way I made sure to write unit tests for all these projects, it is not something I normally do on a regular basis, so it was a good learning curve and has proved itself incredibly useful as the codebases got more complex and features were added.
I had written a load of code, now I wanted to make sure it was production ready, my plan here was to download the code from a load of popular websites, minify them all, and log the processing time and compression achieved. So I wrote a little script to achieve this, which is included in the tests folder.
Running all these sites through my code, threw up loads of issues, which took me a while to work through, but each issue added more unit tests and discovered more bugs until I could minify the sites reliably, and see what the output compression / minify time was.
Torque: Project Number 5
Now I have some good code, I want it to be used. My first thought here was Wordpress, it is the most popular CMS on the Internet, writing a Wordpress plugin seemed like a good way to help my projects gain traction.
So I wrote a Wordpress plugin called Torque, using my minification software and a load of other security and performance options, the Wordpress sites I have tested it on show a good speed improvement in Lighthouse.
There are of course already lots of plugins that minify your Wordpress website, and during development I was intrigued to find out how they were minifying their code, and I discovered that under the hood of most of them, the same projects kept appearing. It made me want to test my software against them to see how it stacked up, and to improve my code.
Minify-Compare: Project Number 6
So I started another project to compare minifiers, the idea here was to expand on the code that downloaded all the website code to throw through my minifier, and compress the code with each of the minifiers I found, logging the time each one took, and what the compression ratio was.
It is this software that I will be using to pit all the different minifiers against each other.
Read The State of Minification in PHP: How 1 Project Grew into 6 (Part 2) to find out how they stacked up.
Top comments (0)