You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo.
I got sucked in because two of my colleagues have entered the competition and are on the leaderboard.
PHP is not known for its speed, but as I am working on the PHP profiler I thought I give it a shot and see how fast it can get.
A first naive approach
I cloned the repository and created the billion row dataset in measurements.txt
. After that I started building my first naive implementation of something that could solve the challenge:
<?php
$stations = [];
$fp = fopen('measurements.txt', 'r');
while ($data = fgetcsv($fp, null, ';')) {
if (!isset($stations[$data[0]])) {
$stations[$data[0]] = [
$data[1],
$data[1],
$data[1],
1
];
} else {
$stations[$data[0]][3] ++;
$stations[$data[0]][2] += $data[1];
if ($data[1] < $stations[$data[0]][0]) {
$stations[$data[0]][0] = $data[1];
}
if ($data[1] > $stations[$data[0]][1]) {
$stations[$data[0]][1] = $data[1];
}
}
}
ksort($stations);
echo '{';
foreach($stations as $k=>&$station) {
$station[2] = $station[2]/$station[3];
echo $k, '=', $station[0], '/', $station[2], '/', $station[1], ', ';
}
echo '}';
There is nothing wild in here, just open the file, use fgetcsv()
to read the data. If the station is not found already, create it, otherwise increment the counter, sum the temperature and see if the current temperature is lower than or higher than min or max and updated accordingly.
Once I have everything together, I use ksort()
to bring the $stations
array in order and then echo out the list and calculate the average temperature while doing so (sum / count).
Running this simple code on my laptop takes 25 Minutes π€―
Time to optimise and have a look at the profiler:
The timeline visualisation helps me see, that this is clearly CPU bound, file compilation at the beginning of the script is negligible and there are no garbage collection events.
The flame graph view is helpful as well in showing that I am spending 46% of CPU time in fgetcsv()
.
fgets()
instead of fgetcsv()
The first optimisation was to use fgets()
to get a line and split on the ;
character manually instead of relying on fgetcsv()
. This is because fgetcsv()
does a whole lot more than what I need.
// ...
while ($data = fgets($fp, 999)) {
$pos = strpos($data, ';');
$city = substr($data, 0, $pos);
$temp = substr($data, $pos+1, -1);
// ...
Additionally I refactored $data[0]
to $city
and $data[1]
to $temp
everywhere.
Running the script again with just this one change already brought the runtime down to 19m 49s. In absolute numbers, this is still a lot but also: it's 21% down!
The flame graph reflects the change, switching to showing the CPU time by line also reveals what is going on in the root frame:
I am spending ~38% CPU time in lines 18 and 23, which are:
18 | $stations[$city][3] ++;
| // ...
23 | if ($temp > $stations[$city][1]) {
Line 18 is the first access to the $stations
array in the loop, otherwise it is only an increment and line 23 is a comparison, nothing that seems expensive at first glance, but lets do few more optimisations and you'll see what is taking time here.
Using a reference where possible
$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
// instead of
$stations[$city][3] ++;
$stations[$city][2] += $data[1];
This should help PHP to not search for the key in the $stations
array on every array access, see it like a cache for accessing the "current" station in the array.
And it actually helps, running this only takes 17m 48s, another 10% down!
Only one comparison
While looking at the code, I stumbled over this piece:
if ($temp < $station[0]) {
$station[0] = $temp;
}
if ($temp > $station[1]) {
$station[1] = $temp;
}
In case the temperature is lower than min, it cannot be higher than max anymore, so I made this an elseif
and maybe spare some CPU cycles.
BTW: I don't know anything about the order of temperatures in measurements.txt
, but depending on that order it could make a difference if I first checked the one or the other.
The new versions takes 17m 30s, which is another ~2%. Better than just jitter, but not by a lot.
Adding type casts
PHP is known as a dynamic language and it is something that I valued a lot when I just got started writing software, one less problem to care about. But on the other side, knowing the types helps the engine make better decisions when running your code.
$temp = (float)substr($data, $pos+1, -1);
Guess what? This simple cast makes the script run in just 13m 32s which is a whopping performance increase of 21%!
18 | $station = &$stations[$city];
| // ...
23 | } elseif ($temp > $station[1]) {
Line 18 still shows up with 11% of CPU time spend, which is the access to the array (finding the key in the hash map, which is the underling data structure used for associative arrays in PHP).
Line 23's CPU time dropped from ~32% to ~15%. This is due to PHP not doing type juggling anymore. Before the type cast, $temp
/ $station[0]
/ $station[1]
were strings
, so PHP had to convert them to float
in order to compare them on every single comparison.
What about JIT?
OPCache in PHP is by default disabled in CLI and needs the opcache.enable_cli
setting to be set to on
. JIT (as part of OPCache) is default enabled, but effectively disabled as the buffer size is set to 0
, so I set the opcache.jit-buffer-size
to something, I just went with 10M
. After these changes have been applied, I re-ran the script with JIT and see it finish in:
7m 19s π
which is 45.9% less time spend!!
What more?
I already brought the runtime down from 25 minutes in the beginning to just ~7 minutes. One thing that I found absolutely astonishing is that fgets()
allocates ~56 GiB/m of RAM for reading a 13 GB file. Something seems off, so I checked the implementation of fgets()
and it looks like that I can spare a lot of these allocations by just omitting the len
argument to fgets()
:
while ($data = fgets($fp)) {
// instead of
while ($data = fgets($fp, 999)) {
Comparing the profile before and after the change gives this:
You might think that this gives a lot of performance improvement, but it is just ~1%. This is because those are small allocations which the ZendMM can handle in bins and those are blazingly fast.
Can we make it even faster?
Yes, we can! So far my approach was single threaded, which is the nature of most PHP software, but PHP does support threads in user land through the parallel extension.
Reading the data in PHP is a bottleneck as the profiler clearly shows. Switching from fgetcsv()
to fgets()
and manual split helps, but this is still taking a lot of time, so lets use threads to parallelise reading and processing the data and then afterwards combine the intermediate results from the worker threads.
<?php
$file = 'measurements.txt';
$threads_cnt = 16;
/**
* Get the chunks that each thread needs to process with start and end position.
* These positions are aligned to \n chars because we use `fgets()` to read
* which itself reads till a \n character.
*
* @return array<int, array{0: int, 1: int}>
*/
function get_file_chunks(string $file, int $cpu_count): array {
$size = filesize($file);
if ($cpu_count == 1) {
$chunk_size = $size;
} else {
$chunk_size = (int) ($size / $cpu_count);
}
$fp = fopen($file, 'rb');
$chunks = [];
$chunk_start = 0;
while ($chunk_start < $size) {
$chunk_end = min($size, $chunk_start + $chunk_size);
if ($chunk_end < $size) {
fseek($fp, $chunk_end);
fgets($fp); // moves fp to next \n char
$chunk_end = ftell($fp);
}
$chunks[] = [
$chunk_start,
$chunk_end
];
$chunk_start = $chunk_end;
}
fclose($fp);
return $chunks;
}
/**
* This function will open the file passed in `$file` and read and process the
* data from `$chunk_start` to `$chunk_end`.
*
* The returned array has the name of the city as the key and an array as the
* value, containing the min temp in key 0, the max temp in key 1, the sum of
* all temperatures in key 2 and count of temperatures in key 3.
*
* @return array<string, array{0: float, 1: float, 2: float, 3: int}>
*/
$process_chunk = function (string $file, int $chunk_start, int $chunk_end): array {
$stations = [];
$fp = fopen($file, 'rb');
fseek($fp, $chunk_start);
while ($data = fgets($fp)) {
$chunk_start += strlen($data);
if ($chunk_start > $chunk_end) {
break;
}
$pos2 = strpos($data, ';');
$city = substr($data, 0, $pos2);
$temp = (float)substr($data, $pos2+1, -1);
if (isset($stations[$city])) {
$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
if ($temp < $station[0]) {
$station[0] = $temp;
} elseif ($temp > $station[1]) {
$station[1] = $temp;
}
} else {
$stations[$city] = [
$temp,
$temp,
$temp,
1
];
}
}
return $stations;
};
$chunks = get_file_chunks($file, $threads_cnt);
$futures = [];
for ($i = 0; $i < $threads_cnt; $i++) {
$runtime = new \parallel\Runtime();
$futures[$i] = $runtime->run(
$process_chunk,
[
$file,
$chunks[$i][0],
$chunks[$i][1]
]
);
}
$results = [];
for ($i = 0; $i < $threads_cnt; $i++) {
// `value()` blocks until a result is available, so the main thread waits
// for the thread to finish
$chunk_result = $futures[$i]->value();
foreach ($chunk_result as $city => $measurement) {
if (isset($results[$city])) {
$result = &$results[$city];
$result[2] += $measurement[2];
$result[3] += $measurement[3];
if ($measurement[0] < $result[0]) {
$result[0] = $measurement[0];
}
if ($measurement[1] < $result[1]) {
$result[1] = $measurement[1];
}
} else {
$results[$city] = $measurement;
}
}
}
ksort($results);
echo '{', PHP_EOL;
foreach($results as $k=>&$station) {
echo "\t", $k, '=', $station[0], '/', ($station[2]/$station[3]), '/', $station[1], ',', PHP_EOL;
}
echo '}', PHP_EOL;
This code does a few things, first I scan the file and split it into chunks that are \n
aligned (for I am using fgets()
later). When I have the chunks ready, I start $threads_cnt
worker threads that then all open the same file and seek to their assigned chunk start and read and process the data till the chunk end, returning an intermediate result that afterwards gets combined, sorted and printed out in the main thread.
This multithreaded approach finishes in just:
1m 35s π
Is this the end?
Nope, certainly not. There is at least two more things to this solution:
- I am running this code on MacOS on Apple Silicon hardware which is crashing when using the JIT in a ZTS build of PHP, so the 1m 35s result is without JIT, it might be even faster if I could use it
- I realised that I was running on a PHP version that was compiled using
CFLAGS="-g -O0 ..."
due my needs in my day to day work π€¦
I should have checked this in the beginning, so I re-compiled PHP 8.3 using CFLAGS="-Os ..."
and my final number (with 16 threads) is:
π 27.7 seconds π
This number is by no means comparable to what you can see in the leaderboard of the original challenge and this is due to the fact that I ran this code on total different hardware.
This is a timeline view with 10 threads:
The thread at the bottom is the main thread, waiting for the results from the worker threads. Once those workers have returned their intermediate results the main thread can be seen working on combining and sorting everything. We can also clearly see, that the main thread is by no means the bottleneck. In case you want to try to optimise this even further, concentrate on the worker threads.
What did I learn on the way?
Each abstraction layer simply trades usability/integration for CPU cycles or memory. fgetcsv()
is super easy to use and hides a lot of stuff, but this comes at a cost. Even fgets()
hides some stuff from us but makes it super convenient to read the data.
Adding types to your code will help the language either optimise execution or will stop type juggling which is something you do not see but still pay for it with CPU cycles.
JIT is awesome, especially when dealing with CPU bound problems!
It is definitely not the nature of most PHP software, but thanks to parallelisation (using ext-parallel
) we could push the numbers down significantly.
The END?
I hope you had as much fun reading this article as I had. In case you want to further optimise the code, feel free to grab this and leave a comment how far you got.
[Edit to add the following on March, 18 2024]
There is MOAR!! performance to gain
After this blog post was released, some ideas arose from the comments how to make it faster and there were three suggestions from @xjakub
Remove the isset()
We might not need to check for isset()
, we can "just" create the reference to the station, it will be NULL
when the station does not exist. This means: one array access spared in case the city does exist (which is the majority of cases).
# before
if (isset($stations[$city])) {
$station = &$stations[$city];
// ..
# after
$station = &$stations[$city];
if ($station !== NULL) {
// ..
This shaves off around 2.5% of wall time!
Don't check on the fgets()
return value
The main loop that reads the file currently looks like this:
while ($data = fgets($fp)) {
$chunk_start += strlen($data);
if ($chunk_start > $chunk_end) {
break;
}
// ..
The added check for $chunk_start > $chunk_end
came with moving to parallelisation, as every thread only works on a part of the file. Now @xjakub mentioned that there is no need to check for the fgets()
return value anymore, as it will always return a string as long as we are in between $chunk_start
and $chunk_end
, meaning we could make this the expression in the loop and just read unchecked.
while ($chunk_start < $chunk_end) {
$data = fgets($fp);
$chunk_start += strlen($data);
// ..
This changes removes a branch from the loop and results in another ~2.7% drop in wall time!
fgets()
vs. stream_get_line()
The actual reading and storing in $city
and $temp
looks as followes:
$data = fgets($fp);
$chunk_start += strlen($data);
$pos2 = strpos($data, ';');
$city = substr($data, 0, $pos2);
$temp = (float)substr($data, $pos2+1, -1);
I had never heard of stream_get_line()
which behaves nearly identical to fgets()
besides that it allows you to specify the end of line delimiter!
$city = stream_get_line($fp, 99, ';');
$temp = stream_get_line($fp, 99, "\n");
$chunk_start += strlen($city) + strlen($temp) + 2;
$temp = (float)$temp;
This change shoved off another ~15% of wall time!
Why is that? The implementations for fgets()
and stream_get_line()
are really close, both are using the PHP stream layer. The major thing that changed is that we do not need to split a string (from fgets()
) into substrings using substr()
anymore. The additional strlen()
call is negligible as variables in PHP are zval's under the hood which hold the length of the string.
So where are we with the wall time for this PHP script?
@codemunky showed up in the comments and ran the benchmark on the same AX161 from Hetzner that the Java folks used to run their implementations on.
The final result (so far): 12.76 seconds π
The END again?
I don't know, maybe there's still something to optimize here, but after managing to spend ~83% of the wall time in the stream_get_line()
function, it looks like we've achieved what the PHP stream layer allows.
Either we find another function that bypasses the PHP stream layer and gives more direct access to the filesystem or we try to optimise the layer itself.
Happy hacking!
Top comments (70)
I love the strategy: write the simple code first, refine performance and then introduce multi-threading. This also emphasizes the point that it's sometimes necessary to know your compiler and the inner working underneath a framework or library, to determine how to gain that performance boost.
A C# example of this would be knowing that if a request is IO bound, and if the function you called uses the CreateIoCompletionPort Win32 function, then it will use an IOCompletePort, where the CPU will release the thread and ask the IOCompletePort to tell the CPU when the work is finished, as opposed to tying up the thread until the work is finished.
There's a lot more going on underneath the hood to do that, but even with just a little bit of the knowledge of the inner working, we can save resources and improve performance.
I tried a standard PHP script execution on a WAMP environment with 8 GB RAM,
and it safely processed approximately 10 million records.
First, really beautiful how you improved the code step by step, and eventually got X60 boost - from half an hour to half a minute. WOW !!
Second, you note that the final best result you got is much different from what is seen in the leaderboard because of a very different hardware. It would be interesting to run this test on a similar hardware. This way we will get an indication if and how much php is slower than java.
I got you. I have the exact same AX161 from Hetzner. The final version of his code executes in 15.4 seconds (with JIT & parallel).
May I ask you if you could re-run the script with the changes that @xjakub suggested? I created a file with those changes at github.com/realFlowControl/1brc
I am very curious about the time you get on that machine
With pleasure.
Ran it five times as my server is a bit busier today: 13.4/12.8/12.9/12.6/12.5
And re-ran the old version five times, for completeness: 16.3/16.0/15.9/15.9/15.9
Interestingly it doesn't seem to really scale past 24 threads. Between 16 and 24 it gets faster the more threads I chuck at it. Over 24 and it starts to slow. nice/ionice make no noticable difference
Whoop whoop, that's awesome
I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):
isset($stations[$city])
, you can check if$station = &$stations[$city];
assigns null! This way you only need to access the hashmap once, and then change$station
as necessary!fgets
and$chunk_start > $chunk_end
to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.As for the ZTS+Mac issue, you can try to either run the code in a Docker container (
php:zts
pluspecl install parallel
was able to run your solution), or use NTS+multiprocessing by spawning separate processes with eitheramphp/parallel
or justproc_open
! (I used the latter in my case)Edit: I also see that I skip a char at
strpos($line, ';', 1);
. I don't think it has any performance impact though πRemoving the
isset()
shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making$chunk_start < $chunk_end
the expression of thewhile
loop and moving thefgets()
into the loop body shove off another 2.7% of wall time on my machine.I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!
Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use
stream_get_line
to fetch the station name, which is faster and simpler than the other approaches!Replied on twitter already, but for visibility: switching to
stream_get_line()
shoved off another 15%This is awesome!
I just tried reading the file with
fread
instead offgets
and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at leastI updated the blog post with your suggestions, thank you!
Incredible! So many cool insights learned from this!
Good stuff!
Type casting and a huge change, neat!!
Have you checked if generators and async how theyβd be?
Also, the sorting, the end result is much faster indeed. But you havenβt kept sorting. It isnβt exactly the same output, is it?
In any case; thanks for sharing and inciting people to experiment with performance tracing!
My God. Thank you for the article. Finally I optimize my code!
What a writing, I learn a lot from your post. Huge thanks
Bravo. I really appreciated the effort. PHP deserves some love and caring; it can pay back 10 fold :-)
Is there any test file to download? As far as I understood the source data can be created from a script, but for a first test a simple file would be easier.
Hey there,
I create a repo with the source code for PHP at github.com/realFlowControl/1brc
This also has a PHP script to generate the
measurements.txt
:Hope this helps!
Hy Florian,
works great! I The timing is reportet in ms, but it seems to be seconds.
It took about 160s to write a 132 MB file on my local drive, but the time to read the file will greatly depend on the file system, caching and the reading strategy. Depending on your system the reading will be much faster the second time your open the file.
Notepad++ opens the file without any delay, as they only read the part that is displayed. The standard Editor takes about one minute to do the same, as they read the whole file at once. We can do the same on reading the file in chunks, as file operations are handled in the background by the hard drive, this operations do usually not put any strain on the processor.
So, if we do so we get a nice contest on file operations, but as you have not much control over the details on most high level languages, what does the contest really measure?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.