DEV Community

Processing One Billion Rows in PHP!

Florian Engelhardt on March 08, 2024

You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo. I got sucked in ...

Read full post

Mykezero • Mar 8

I love the strategy: write the simple code first, refine performance and then introduce multi-threading. This also emphasizes the point that it's sometimes necessary to know your compiler and the inner working underneath a framework or library, to determine how to gain that performance boost.

A C# example of this would be knowing that if a request is IO bound, and if the function you called uses the CreateIoCompletionPort Win32 function, then it will use an IOCompletePort, where the CPU will release the thread and ask the IOCompletePort to tell the CPU when the work is finished, as opposed to tying up the thread until the work is finished.

There's a lot more going on underneath the hood to do that, but even with just a little bit of the knowledge of the inner working, we can save resources and improve performance.

Gunabalan.S • Mar 8

I tried a standard PHP script execution on a WAMP environment with 8 GB RAM,
and it safely processed approximately 10 million records.

<?php
/**
 * The test.csv file contains a limited data set to verifying the accuracy of calculations.
 * 
 * karaikal;1
 * karaikal;2
 * karaikal;2
 * karaikal;3
 * Mumbai;1
 * Mumbai;2
 * Mumbai;3
 * Mumbai;4
 * Mumbai;4
 * 
 * then go with actual dataset : weather_stations.csv
 */
$seconds = time(); //time stamp

//load file
//$fileHandle = fopen('./weather_stations.csv', 'r');
$fileHandle = fopen('./test.csv', 'r');

$data = [];

$index = 0;
while (($line = fgets($fileHandle)) !== false) {
    $values = explode(";", $line);
    $key = trim($values[0]);
    $value = trim($values[1]);
    $data[$key]['m1'] = isset($data[$key]['m1']) ? ($data[$key]['m1'] . ',' . $value) : $value;
    $data[$key]['m'][$value] = ($data[$key]['m'][$value] ?? 0) + 1;
    $index++;
}

krsort($data);

foreach ($data as $key => $val) {
    $dataPerCity = explode(',', $val['m1']);
    $count = count($dataPerCity);
    $sum = array_sum($dataPerCity);
    $middle = floor(($count - 1) / 2);

    // Mean (Average)
    $mean = $sum / $count;

    // Median
    $median = ($count % 2 == 0) ? ($dataPerCity[$middle] + $dataPerCity[$middle + 1]) / 2 : $dataPerCity[$middle];

    // Mode
    $mode = array_search(max($val['m']), $val['m']);

    // Output results
    echo "$key / Mean: $mean Median: $median Mode: $mode\n";
}

fclose($fileHandle);

$seconds2 = time();
$elapsedTime = $seconds2 - $seconds;
echo "Elapsed time in seconds: $elapsedTime\n";
echo "Records processed : $index\n";

Moshe Dolev • Mar 9

First, really beautiful how you improved the code step by step, and eventually got X60 boost - from half an hour to half a minute. WOW !!
Second, you note that the final best result you got is much different from what is seen in the leaderboard because of a very different hardware. It would be interesting to run this test on a similar hardware. This way we will get an indication if and how much php is slower than java.

codemunky • Mar 13

I got you. I have the exact same AX161 from Hetzner. The final version of his code executes in 15.4 seconds (with JIT & parallel).

Florian Engelhardt • Mar 18

May I ask you if you could re-run the script with the changes that @xjakub suggested? I created a file with those changes at github.com/realFlowControl/1brc
I am very curious about the time you get on that machine

codemunky • Mar 18

With pleasure.

Ran it five times as my server is a bit busier today: 13.4/12.8/12.9/12.6/12.5

And re-ran the old version five times, for completeness: 16.3/16.0/15.9/15.9/15.9

codemunky • Mar 18

Interestingly it doesn't seem to really scale past 24 threads. Between 16 and 24 it gets faster the more threads I chuck at it. Over 24 and it starts to slow. nice/ionice make no noticable difference

Florian Engelhardt • Mar 13

Whoop whoop, that's awesome

Jakub a. Ritchie • Mar 13 • Edited

I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):

Instead of isset($stations[$city]), you can check if $station = &$stations[$city]; assigns null! This way you only need to access the hashmap once, and then change $station as necessary!
You are checking both fgets and $chunk_start > $chunk_end to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.

As for the ZTS+Mac issue, you can try to either run the code in a Docker container (php:zts plus pecl install parallel was able to run your solution), or use NTS+multiprocessing by spawning separate processes with either amphp/parallel or just proc_open! (I used the latter in my case)

Edit: I also see that I skip a char at strpos($line, ';', 1);. I don't think it has any performance impact though 😄

Florian Engelhardt • Mar 14

Removing the isset() shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making $chunk_start < $chunk_end the expression of the while loop and moving the fgets() into the loop body shove off another 2.7% of wall time on my machine.

I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!

Jakub a. Ritchie • Mar 14

Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use stream_get_line to fetch the station name, which is faster and simpler than the other approaches!

Florian Engelhardt • Mar 14

Replied on twitter already, but for visibility: switching to stream_get_line() shoved off another 15%
This is awesome!

Jakub a. Ritchie • Mar 13

I just tried reading the file with fread instead of fgets and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at least

Florian Engelhardt • Mar 18

I updated the blog post with your suggestions, thank you!

CodeWithCaen • Mar 8

Incredible! So many cool insights learned from this!

Renoir • Mar 9 • Edited

Good stuff!

Type casting and a huge change, neat!!

Have you checked if generators and async how they’d be?

Also, the sorting, the end result is much faster indeed. But you haven’t kept sorting. It isn’t exactly the same output, is it?

In any case; thanks for sharing and inciting people to experiment with performance tracing!

Токмаков Александр • Mar 9

My God. Thank you for the article. Finally I optimize my code!

Risal Hidayat • Mar 10

What a writing, I learn a lot from your post. Huge thanks

Valentin Iljaž • Mar 9

Bravo. I really appreciated the effort. PHP deserves some love and caring; it can pay back 10 fold :-)

Eckehard • Mar 13

Is there any test file to download? As far as I understood the source data can be created from a script, but for a first test a simple file would be easier.

Florian Engelhardt • Mar 13

Hey there,
I create a repo with the source code for PHP at github.com/realFlowControl/1brc
This also has a PHP script to generate the measurements.txt:

php createMeasurements.php 10000 # to create a file with 10k entries

Hope this helps!

Eckehard • Mar 13

Hy Florian,

works great! I The timing is reportet in ms, but it seems to be seconds.

It took about 160s to write a 132 MB file on my local drive, but the time to read the file will greatly depend on the file system, caching and the reading strategy. Depending on your system the reading will be much faster the second time your open the file.

Notepad++ opens the file without any delay, as they only read the part that is displayed. The standard Editor takes about one minute to do the same, as they read the whole file at once. We can do the same on reading the file in chunks, as file operations are handled in the background by the hard drive, this operations do usually not put any strain on the processor.

So, if we do so we get a nice contest on file operations, but as you have not much control over the details on most high level languages, what does the contest really measure?

Chief Technical Officer • Mar 9

Wow you gave me a great idea for my own application lol. Use threads!

Michael Kimsal • Mar 11

Could you post the code to a github repo? Would be nice to get a link to this on 1brc.dev - however, the convention seems to be to link to a github repo, not a blog post.

Excellent work!

Florian Engelhardt • Mar 12

Hey there,
thanks for the hint, I created a GitHub repository and added the link to the issue you opened already! Thanks for that!

Michael Kimsal • Mar 13

@realflowcontrol The repo is now linked from 1brc.dev. I was about to submit a PR but I think you beat me to it.

Once again, great work!

Emmanuel Eneche • Mar 9

Great insights. This approach of optimization will definitely come in handy particularly in data-driven PHP applications that implements big data processing and ETL operations.

A classical example is in the case of receiving seller data from Amazon Seller central, knowing that one seller/user could have millions of records which Amazon will forward to the PHP application.

It becomes a wild goose chase when the server has to process these millions of records for ~25mins. The result is often server time-outs or crash in execution.

With this approach, I see that there's refinement in performance to serve resources at improved speed.

Vinicius Dias • Mar 13

Did you consider using spl or even ext-ds instead of arrays? Maybe it would help a little bit as well.

Florian Engelhardt • Mar 13

That is a good idea, thanks for letting me know. I'll see if I find time. If you'd like to give it a shot, you can find a GitHub repo holding the source at github.com/realFlowControl/1brc

Although profiling shows me that most of the time (>80% of wall time) is spend in reading the data from the file via fgets() and then converting (split on ; and type cast). Nevertheless, shoving of some time with a better data structure would be nice, also a nice opportunity to dive into ext-ds

Ashish Pawaskar • Mar 11

Beautiful!! Your method refined my thought-process significantly... thanks for this!!

Marllon Gomes • Apr 4

Great article! it's always good to learn new performance techniques

Dat Vo • Mar 13

Great post. Thanks for your insight and approach.

Matteo Mazzoli • Apr 3

@realflowcontrol Thanks for the fantastic article! If i can, there is a typo..
CFLAGS and CLFAGS

Florian Engelhardt • Apr 3

Fixed it, thank you and awesome you liked it!

sina • Mar 11

Thank you so much for this post, it was realy grate

Albert Angmor • Mar 10

This was incredible. Your work has given some credibility yo my favorite language. Hats off to you.

Md. Hazzaz Bin Faiz • Mar 8

Which tool are you using for profiling and visualization of profiling output?

Florian Engelhardt • Mar 8 • Edited

In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend.

Eckehard • Mar 9

Interesting task. Just, the dataset is not very realistic. Usually you would have a time stamp for each reading. So, you would need to process the time stamps too.

In a real live case, you would try to avoid those "brute force" reading:

Put your data in a time series database like influx-DB
Use separate tables for sepearate weather stations to make the number of data to be processed smaler
Use some "pre-aggregation": If you know the average for each hour, you can calculate a yearly average from 8760 values, and not from every reading.

Jack • Mar 9 • Edited

This is the right answer, and even simple business metrics get segmented like this. The actual "solutions" you end up needing in the real world come down to utilizing efficient queries and table structures in synchronicity with caching. Intelligent database design is what you need to process through billions of rows... Lowly SQL, so it confuses me, also, when people compare how fast a language is at performing tasks which, in real life, are going to be handled by the database. I appreciate the thought exercise, but we end up arguing about which orange soda tastes best at the apple pie fair.

enrique frances • Mar 13

Hey! It's a great attempt form you

Frantisek Heca • Mar 15

If, after php optimizations, the speed is capped mostly by the "line reading". I do wonder, if it is allowed to install any extension in this test? The idea is to "hack" customized version of the "stream_get_line" just for this scenario.

Florian Engelhardt • Mar 18

I mean I had to install ext-parallel in order to run threads in user land. I guess something that is specifically targeted at the 1brc challenge might count as cheating, but some extensions that do exist already, like ext-ds for data structures, why not? I am currently unaware of an extension that bypasses the PHP stream layer, but something like this could help.

Ivan Zanev • Mar 16

Great step-by-step optimization. Two things coming to mind: if you use explode or a simple for loop to iterate through characters, that might cut off some cycles. Also, I'm wondering about the ksort() since it sorts by cities, maybe there can be an array and then just iterate through the sub-arrays?

sina • Mar 21

Hi , thank you so much for this test , i wanted to test that but creating a dataset with 1000000000 row takes so much time , maybe 1 ! week , and the final size would be about 20 gig. i have created a bash script that generate the data , is there any optimizer way or what !?

Florian Engelhardt • Mar 22

You are welcome!
The resulting file should be around 13 GB and there is a faster version of the generator available in the original repo at github.com/gunnarmorling/1brc in create_measurements_fast.sh

KREP BULL • Jul 8

@realflowcontrol

Thanks for this great article! I tried to run the krakjoe\parallel code locally on PHP 8.3 and ran into zend_mm_heap corrupted error - but only when thred count is greater than 1.

Any idea how to troubleshoot it?

Cheers!

Florian Engelhardt • Jul 9

Whoops, can you open an issue here: github.com/krakjoe/parallel/issues...
I would need the exact PHP and parallel version (including patch) and whatever else you got.

Florian Engelhardt • Jul 9

One more thing: are you using the latest released version or the version from the develop branch? I am asking, because the develop branch has a know zend_mm_heap corrupted error that needs fixing before I do a new minor release

Ravavyr • Mar 11

This was a pretty cool write up, giving people an idea of how to approach optimizing their code. really enjoyed that.
A couple of questions:

what is your hardware [RAM? CPU?]
have you tried outputting the processed code [i'm pretty sure rendering a billion rows would crash just about any browser?]

Florian Engelhardt • Mar 12 • Edited

Hey Ravavyr,
glad that you liked it!
I ran the code on M1 MacBook Pro with 64 GB of Ram.
The code just outputs to stdout which on invocation I do redirect to some file on disk.

php run.php > out.json

Hope this helps 🖖

rujorgensen • Mar 9

This was interesting in spite of being on the topic of PHP 😄

savgoran • Aug 19

Hi Florian, do you recommend some online courses/tutorials for DataDog?

Florian Engelhardt • Aug 20

Hey @savgoran ,
the official docs should have you covered, for individual question you can sign up to Datadog's public Slack. In case it is about profiling and/or tracing, you'll find a #apm and #profiling channel in that Slack to ask folks, I am there as well.

savgoran • Aug 20

Thank you

Przemysław Lazarek • Apr 4

Hi!
maybe you will use swoole for co-routines and async::read for asynchronous reads?

sina • Mar 18

Thank you , what php profiler did you used ?

Florian Engelhardt • Mar 18

In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend.

R. Rsai - RD • Mar 13

I’ve never used a profiler in PHP before, so maybe could we get a link to the profiler?

Florian Engelhardt • Mar 13

Hey there,
the profiler I used is the one I am working on in my day job. You can find it at github.com/DataDog/dd-trace-php/tr...
This profiler sends the gathered samples to the Datadog backend where I took the screenshots from.

Ahmed EBEN HASSINE 脳の流れ • Sep 24

Could you explain why it's better to use the 8.4 beta version instead of PHP 8.3? Are there any performance improvements, and if so, where exactly?

Florian Engelhardt • Oct 1

Hey Ahmed,
PHP 8.4 comes with a new JIT and frameless function calls and such. Generally speaking: newer versions of PHP should always be faster, not always by a lot ;-)

Roy Joling • Sep 8

Impressive, Which profiler did you use to gain these insights?

Florian Engelhardt • Sep 9

In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend. The profiler is open source and can be found at: github.com/DataDog/dd-trace-php

chx • Mar 12

I wonder, would it be possible to use fibers here (which is core since 8.1) instead of parallel?

Florian Engelhardt • Mar 12

Fibers are cooperatively scheduled, meaning they run in your current thread and you have to schedule them manually. We can't use fibers to do parallel processing like it is possible with threads.
There is a nice writeup about it at php.watch/versions/8.1/fibers

jaspersmitNL • Mar 27

Which tool did you use to create the timeline view of the performance of the PHP scripts?

Florian Engelhardt • Mar 27

In my day job im building a PHP profiler for Datadog where this timeline visualisation is relatively new.