This week I was tasked with building a web application for work in PHP that would take in multiple data inputs, connect them all into one multi-dimensional associative array, do some mathematical calculations and then output to the user a CSV file with the results.
The data in those different data sources, as is often the case, did not come in perfect shape. Prior to even connecting the pieces together, the data needed to be checked for errors and cleaned up. Some data needed to be trimmed of extraneous characters. Some data needed to be filtered for non-numeric values. While some other data needed to be checked for duplicates.
Then, once the data was cleaned, the data inputs needed to be compared to each other for records to join together. One data source held some information for a row of information, another data source held more information, another data source yet held more and, finally, the fourth data source held the last bit of information to make a complete record.
As you can imagine, this sort of application involves a lot of iterating. Iterating over the data to clean it up. Iterating over the data to compare it, iterating within the comparison to check for further equality comparisons and applying transformations and more.
Nowadays, space is cheap but time is still expensive. These sort of iterations can be very time expensive. Anyone who has ever worked with PHP knows the feeling of testing your application and coming across the dreaded Fatal Error: Maximum execution time exceeded
error.
Until you actually encounter a problem involving time complexity, the notion of Big O can seem abstract. However, just spending a few minutes trying to optimize your code to make it faster, can make you very quickly appreciate how foundational it is.
For example, take a look at the following:
for ($i = 0; $i < count($foo); $i++) {
for ($x = 0; $x < $count($foo_two); $x++) {
// do something...
}
}
This is an incredibly common example when comparing two arrays of data. You want to loop over one while you are also looping over the other. What is the cost of this? If each for loop is O(n), then a nested for loop inside of a for loop would be O(n)2. The n representing the time it would take to find the value at the very end of the data set, so each for loop doubles the time complexity in your operation. That can add up really quickly and come at quite a cost to your application's performance.
What do you do when looping over the data is a necessity? How can you optimize it?
Let's look at a few tips, especially suited for PHP, to increase our efficiency.
Preset your total
We all know the classic for loop expression seen in every language: for ($i = 0; $i < count($array); $i++)
. The second part of that expression where we evaluate whether $i
is still less than the length of the array each time we loop through is costly. Every time we loop through, our code has to recount how big our array is. What if we defined the length in advance?
$length = count($array);
for ($i = 0; $i < $length; $i++) {
// do something
}
Now, with our $length
variable holding the size of the array, we do not make our application to check the length each time it loops. According to one benchmark, using a pre-calculated variable instead of forcing the count each loop can be 94% faster. That can quickly add up.
Single Quote vs Double Quote
The time difference between using single quotes and double quotes is less noticeable than it used to be. Nonetheless, it is still good practice, in my opinion, to use single quotes for plain strings of text. Why is that? Every time you use double quotes PHP checks to see if there is any code that needs to be evaluated inside the quotation. So, if you are comparing an object in an array referencing its value by key using $array['key']
tells the code that key
is just a string of text, while using $array["key"]
makes it check to see if key
is in reference to anything other than just a string of text.
What Are You Doing Inside That For Loop?
What are you asking your program to do every time it loops through? Traversing through your data costs time just by itself. Now, factor in the cost of whatever you want to do to that data, and you can just keep on adding to the time complexity of your application. Some operations are more expensive than others. Of course, sometimes, you have no choice but to perform a costly operation inside the for loop, but if you can avoid it, your code will be all the more efficient.
An example of a common operation that can increase the time complexity quickly is string concatenation. This is an often overlooked part of your program. After all, what you are doing is just putting some text together. But, add that concatenation up by the number of times you are looping, and it can easily worsen your performance.
What might be a more performant alternative to concatenating in the moment of the for loop? How about adding your text to an array and, when you need to output the text, imploding the array? Adding plain text values to an array can be faster than concatenating text in the midst of traversing your data. Another possibility, if it meets your requirements, is simply to just echo
the text immediately without concatenation. Of course, that is not always a possibility depending on what you need your application to do.
In the end, applying these principles to the application I was tasked with creating, along with some other principles, helped achieve a significant performance boost. Giving thought to time complexity as an integral part of the development process can save a lot of pain later on when needing to refactor existing code that is just not making the time cut.
Big O is not just for whiteboard interviews, but rather, an appreciation for it out in the field, can make more efficient code, which makes happier users, and by extension, happier developers.
Top comments (8)
I've had a performance problem that i solved crossing the boundaries of PHP with other technologies.
I needed to loop over a CSV with thousands of rows, sanitize and validate its values, and then update a MySQL table if the "check in" column was after the one already present in the DB, or insert the row if it wasn't already present.
Checking each date in a loop, even with preloading all table records in advance and stacking all updates/insert queries to result in only 1 SELECT, 1 INSERT and 1 UPDATE query with multiple rows attached resulted in a script running sometimes even for 10+ minutes, which was obviously not acceptable for a HTTP request.
What i did was to update the INSERT statement with the
ON DUPLICATE KEY UPDATE ....
clause and create an update trigger on the MySQL table that discarded the update if the new checkin date was before the old date.This resulted in only 1 INSERT query and my script finished in a bunch of milliseconds.
As @vasilvestre said, sometimes you just need to use better tools to improve execution time, and that does not mean to drastically change technology, but rather to check what's at your disposal and find the perfect combination.
100% agree to this comment. Fully taking advantage of what you have available within your organization or business specifications can create some drastic improvements. Oftentimes it's striking the balance of what load we put on the client side, what load we put on the server side, what load we have done at execution time, and what can we do post-execution (like your update trigger in your MySQL DB). A creative approach to that can do wonders as well.
I think you do not use the right tool, to manipulate multiple sources of data and output the data, I use Talend Open Studio for Big Data.
It's free, work great and easy to learn.
(I don't think that you really don't use the right tool, I just think that there's better tool)
Anyway talking about your loop, I use foreach() whenever I can because it's kinda same as fori() but easier to use.
Yes, I agree there are some great tools out there for managing multiple sources of data, but sometimes we are limited by the business specifications. I also agree that foreach() looks a lot nicer than a for() loop, but I'm not 100% sure that it is faster than a for() loop with the length calculated in advance. I've seen a lot written about that subject with various benchmarks going both ways. Of course, in most situations, the difference will be negligible either way.
As you can see here : 3v4l.org/7Kvf5 , PHP7 made foreach() 50% as fast as for(). It depend of version tho
To be honest, that's premature optimisation to use for() instead of foreach() from now. If you really need high-performance, you shouldn't use PHP imo
Unfortunately, sometimes those decisions are made for you by others...
Sometimes it's worth mentionning the fact that with low effort, you can improve a lot the process and (because that's the only thing that interest people) the result
Hi Ben, I love that performance improvement exercise: any chance that you publish the code with a sample so we can dig in it and try to help with performance improvements? :)