My name is Will, alias Hexydec, and I am a developer. I have been developing websites professionally for 23 years, and I have always been obsessed with performance, how you can optimise your web application to give users the best possible experience.
In this article I want to talk about minification, the process of removing extraneous bytes from the code you deliver to the client, to make the payload as small as possible. I want to talk about all the open source PHP minifiers there are out there, and try to show how they stack up against each other.
First let me fess up here, I have written a number of PHP minification programs in my career, and 3 years ago I started my own GitHub project to write an HTML minifier in PHP. Partly this article is to present that software to you, but it is also to show you the research I did myself to see how my project stacked up against the competition and improve my software, hopefully you will find the results interesting, as I did.
If you are not interested in how all this came about and just want to skip straight to the results, read part 2.
The HTML minifier I committed used some loops of Regular Expressions to find and replace patterns within the code to remove the unwanted whitespace etc, but the CSS minifier was a much more interesting beast – it broke the CSS down into an internal array that represented the document, changed some properties in the array, and then compiled it back to CSS – it is a compiler.
So why did I write a pattern based replacement mechanism for HTML and a compiler for CSS (I wrote it years ago)? I am guessing this is because the structure of CSS is much flatter, and much more predicable. You will see later on how this plays out with the other minifiers I tested.
So how does a minifier work? Well it either:
- takes the input code, uses regular expressions to find patterns within the code, and replaces them with the same code minus all the bits you don’t need, or;
- it splits the input code into tokens, loops through them, consuming each one to generate an internal representation of the document, performs optimisations on that representation, and then compiles it back to code
It really depends on what you are trying to do, regular expressions alone are faster, but they have their limits. The issue with them is that they just look for patterns, what is matched doesn’t have any context.
If you have a language like CSS, it is quite predictably structured, this would be easier to use regular expressions to minify, but even so you will need some callback code to process the values to do individual optimisations.
A compiler on the other hand has a number of stages to parse the code into a more understandable structure, therefore it is slower, but the result is that each bit of code has a relationship to the other bits of code, and once you have it in this format it is then much easier to work with.
Edge cases are now much easier to handle because your code has an understanding of what each piece of code is, it is then simpler to work out what you can do to it. You can also handle errors in the input code, the context enables you to know what to expect next, and if the token that appears is not what is expected, then you can discard it or throw an error, whatever is appropriate for what you are trying to achieve.
Because of the context, you can also perform more optimisations than you would be able to with regular expressions. Take HTML minification for example, an optimisation such as removing closing tags is much easier when you know what tags are next to or ancestors of the current tag.
So with regular expressions you get speed and ease of development, with compilers you get reliability, the ability to handle more complex input, handle errors, and better optimisation.
There is also a third option, as sort of halfway house, I am not sure what you would call it, so I am going to call it a “Linear Consumer”. It processes the input in a linear fashion, using a tokeniser, it then processes the code and spits out the optimised code as the tokens are generated.
By doing all the processing as you go, you don’t need to create an intermediate representation, thus improving performance, but the tradeoff is that you can only perform optimisations within close range of the current tokeniser position.
My GitHub project is called HTMLdoc, and as you might have guessed it is a compiler. It does use regular expressions, but only to tokenise the input. This keeps the regular expressions fairly simple; it just splits the input into bits ready for consumption by the parser.
Note that I have tried writing a string based tokeniser, but regular expressions are about 3 times faster. This is probably because of the amount of userland code required to chop the input up, whereas with RegExp all that work is done in optimised C code.
The output tokens are then parsed into objects, each representing a part of the code, such as a tag, textnode, or comment. A
minify() method then distributes the minify command to each object in the tree, which performs any optimisation on each bit of the code.
compile() command orders each object to render itself and its children as HTML. There is a fair amount of complexity in doing all of this, hence the reason for using objects which splits the functionality into smaller more modular bits, I have also tried to make it all as configurable as possible, so you can control things like which tags are pre-formatted, and what minification optimisations to perform.
You can also query the object using CSS expressions to extract text and attributes. If you have ever used a program called
simple_dom_html then this functionality is akin to that, but with a built in minifier (and actually maintained by someone!).
So I already had an old CSS compiler, and in my pursuit of the highest compression, it was clear that I needed to rewrite this to compile any inline CSS in the HTML, enter another Github project - CSSdoc.
It uses the same sort of code layout and tokeniser as HTMLdoc, but due to CSS’s predictable structure, I elected not to retain the input whitespace, this was for better memory and performance. Instead it has two compiler modes, minify and beautify.
I started out with a big regular expression to do this, I wanted the code to be as fast as possible. But it soon became clear that this was not going to cut it. So I redeveloped it as a compiler.
All 3 projects now used similar code layouts and a similar tokeniser, so I unified the tokeniser across the projects by moving the tokeniser into its own GitHub project. I also had to put all the projects on packagist and bring the projects together using composer.
I always wanted to make sure my code was reliable, therefore along the way I made sure to write unit tests for all these projects, it is not something I normally do on a regular basis, so it was a good learning curve and has proved itself incredibly useful as the codebases got more complex and features were added.
I had written a load of code, now I wanted to make sure it was production ready, my plan here was to download the code from a load of popular websites, minify them all, and log the processing time and compression achieved. So I wrote a little script to achieve this, which is included in the tests folder.
Running all these sites through my code, threw up loads of issues, which took me a while to work through, but each issue added more unit tests and discovered more bugs until I could minify the sites reliably, and see what the output compression / minify time was.
Now I have some good code, I want it to be used. My first thought here was Wordpress, it is the most popular CMS on the Internet, writing a Wordpress plugin seemed like a good way to help my projects gain traction.
So I wrote a Wordpress plugin called Torque, using my minification software and a load of other security and performance options, the Wordpress sites I have tested it on show a good speed improvement in Lighthouse.
There are of course already lots of plugins that minify your Wordpress website, and during development I was intrigued to find out how they were minifying their code, and I discovered that under the hood of most of them, the same projects kept appearing. It made me want to test my software against them to see how it stacked up, and to improve my code.
So I started another project to compare minifiers, the idea here was to expand on the code that downloaded all the website code to throw through my minifier, and compress the code with each of the minifiers I found, logging the time each one took, and what the compression ratio was.
It is this software that I will be using to pit all the different minifiers against each other.
Read The State of Minification in PHP: How 1 Project Grew into 6 (Part 2) to find out how they stacked up.