DEV Community

Cover image for Processing One Billion Rows in PHP!

Processing One Billion Rows in PHP!

Florian Engelhardt on March 08, 2024

You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo. I got sucked in ...
Collapse
 
mykezero profile image
Mykezero

I love the strategy: write the simple code first, refine performance and then introduce multi-threading. This also emphasizes the point that it's sometimes necessary to know your compiler and the inner working underneath a framework or library, to determine how to gain that performance boost.

A C# example of this would be knowing that if a request is IO bound, and if the function you called uses the CreateIoCompletionPort Win32 function, then it will use an IOCompletePort, where the CPU will release the thread and ask the IOCompletePort to tell the CPU when the work is finished, as opposed to tying up the thread until the work is finished.

There's a lot more going on underneath the hood to do that, but even with just a little bit of the knowledge of the inner working, we can save resources and improve performance.

Collapse
 
gunabalans profile image
Gunabalan.S

I tried a standard PHP script execution on a WAMP environment with 8 GB RAM,
and it safely processed approximately 10 million records.

<?php
/**
 * The test.csv file contains a limited data set to verifying the accuracy of calculations.
 * 
 * karaikal;1
 * karaikal;2
 * karaikal;2
 * karaikal;3
 * Mumbai;1
 * Mumbai;2
 * Mumbai;3
 * Mumbai;4
 * Mumbai;4
 * 
 * then go with actual dataset : weather_stations.csv
 */
$seconds = time(); //time stamp

//load file
//$fileHandle = fopen('./weather_stations.csv', 'r');
$fileHandle = fopen('./test.csv', 'r');

$data = [];

$index = 0;
while (($line = fgets($fileHandle)) !== false) {
    $values = explode(";", $line);
    $key = trim($values[0]);
    $value = trim($values[1]);
    $data[$key]['m1'] = isset($data[$key]['m1']) ? ($data[$key]['m1'] . ',' . $value) : $value;
    $data[$key]['m'][$value] = ($data[$key]['m'][$value] ?? 0) + 1;
    $index++;
}

krsort($data);

foreach ($data as $key => $val) {
    $dataPerCity = explode(',', $val['m1']);
    $count = count($dataPerCity);
    $sum = array_sum($dataPerCity);
    $middle = floor(($count - 1) / 2);

    // Mean (Average)
    $mean = $sum / $count;

    // Median
    $median = ($count % 2 == 0) ? ($dataPerCity[$middle] + $dataPerCity[$middle + 1]) / 2 : $dataPerCity[$middle];

    // Mode
    $mode = array_search(max($val['m']), $val['m']);

    // Output results
    echo "$key / Mean: $mean Median: $median Mode: $mode\n";
}

fclose($fileHandle);

$seconds2 = time();
$elapsedTime = $seconds2 - $seconds;
echo "Elapsed time in seconds: $elapsedTime\n";
echo "Records processed : $index\n";
Enter fullscreen mode Exit fullscreen mode

Image description

Collapse
 
moshedolev profile image
Moshe Dolev

First, really beautiful how you improved the code step by step, and eventually got X60 boost - from half an hour to half a minute. WOW !!
Second, you note that the final best result you got is much different from what is seen in the leaderboard because of a very different hardware. It would be interesting to run this test on a similar hardware. This way we will get an indication if and how much php is slower than java.

Collapse
 
codemunky profile image
codemunky

I got you. I have the exact same AX161 from Hetzner. The final version of his code executes in 15.4 seconds (with JIT & parallel).

Collapse
 
realflowcontrol profile image
Florian Engelhardt

May I ask you if you could re-run the script with the changes that @xjakub suggested? I created a file with those changes at github.com/realFlowControl/1brc
I am very curious about the time you get on that machine

Thread Thread
 
codemunky profile image
codemunky

With pleasure.

Ran it five times as my server is a bit busier today: 13.4/12.8/12.9/12.6/12.5

And re-ran the old version five times, for completeness: 16.3/16.0/15.9/15.9/15.9

Thread Thread
 
codemunky profile image
codemunky

Interestingly it doesn't seem to really scale past 24 threads. Between 16 and 24 it gets faster the more threads I chuck at it. Over 24 and it starts to slow. nice/ionice make no noticable difference

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Whoop whoop, that's awesome

Collapse
 
xjakub profile image
Jakub a. Ritchie • Edited

I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):

  • Instead of isset($stations[$city]), you can check if $station = &$stations[$city]; assigns null! This way you only need to access the hashmap once, and then change $station as necessary!
  • You are checking both fgets and $chunk_start > $chunk_end to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.

As for the ZTS+Mac issue, you can try to either run the code in a Docker container (php:zts plus pecl install parallel was able to run your solution), or use NTS+multiprocessing by spawning separate processes with either amphp/parallel or just proc_open! (I used the latter in my case)

Edit: I also see that I skip a char at strpos($line, ';', 1);. I don't think it has any performance impact though 😄

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Removing the isset() shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making $chunk_start < $chunk_end the expression of the while loop and moving the fgets() into the loop body shove off another 2.7% of wall time on my machine.

I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!

Collapse
 
xjakub profile image
Jakub a. Ritchie

Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use stream_get_line to fetch the station name, which is faster and simpler than the other approaches!

Thread Thread
 
realflowcontrol profile image
Florian Engelhardt

Replied on twitter already, but for visibility: switching to stream_get_line() shoved off another 15%
This is awesome!

Collapse
 
xjakub profile image
Jakub a. Ritchie

I just tried reading the file with fread instead of fgets and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at least

Collapse
 
realflowcontrol profile image
Florian Engelhardt

I updated the blog post with your suggestions, thank you!

Collapse
 
codewithcaen profile image
CodeWithCaen

Incredible! So many cool insights learned from this!

Collapse
 
calliko profile image
Токмаков Александр

My God. Thank you for the article. Finally I optimize my code!

Collapse
 
renoirb profile image
Renoir • Edited

Good stuff!

Type casting and a huge change, neat!!

Have you checked if generators and async how they’d be?

Also, the sorting, the end result is much faster indeed. But you haven’t kept sorting. It isn’t exactly the same output, is it?

In any case; thanks for sharing and inciting people to experiment with performance tracing!

Collapse
 
rslhdyt profile image
Risal Hidayat

What a writing, I learn a lot from your post. Huge thanks

Collapse
 
valentiniljaz profile image
Valentin Iljaž

Bravo. I really appreciated the effort. PHP deserves some love and caring; it can pay back 10 fold :-)

Collapse
 
keithprinkeyops profile image
Chief Technical Officer

Wow you gave me a great idea for my own application lol. Use threads!

Collapse
 
cviniciussdias profile image
Vinicius Dias

Did you consider using spl or even ext-ds instead of arrays? Maybe it would help a little bit as well.

Collapse
 
realflowcontrol profile image
Florian Engelhardt

That is a good idea, thanks for letting me know. I'll see if I find time. If you'd like to give it a shot, you can find a GitHub repo holding the source at github.com/realFlowControl/1brc

Although profiling shows me that most of the time (>80% of wall time) is spend in reading the data from the file via fgets() and then converting (split on ; and type cast). Nevertheless, shoving of some time with a better data structure would be nice, also a nice opportunity to dive into ext-ds

Collapse
 
emmabase profile image
Emmanuel Eneche

Great insights. This approach of optimization will definitely come in handy particularly in data-driven PHP applications that implements big data processing and ETL operations.

A classical example is in the case of receiving seller data from Amazon Seller central, knowing that one seller/user could have millions of records which Amazon will forward to the PHP application.

It becomes a wild goose chase when the server has to process these millions of records for ~25mins. The result is often server time-outs or crash in execution.

With this approach, I see that there's refinement in performance to serve resources at improved speed.

Collapse
 
efpage profile image
Eckehard

Is there any test file to download? As far as I understood the source data can be created from a script, but for a first test a simple file would be easier.

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Hey there,
I create a repo with the source code for PHP at github.com/realFlowControl/1brc
This also has a PHP script to generate the measurements.txt:

php createMeasurements.php 10000 # to create a file with 10k entries
Enter fullscreen mode Exit fullscreen mode

Hope this helps!

Collapse
 
efpage profile image
Eckehard

Hy Florian,

works great! I The timing is reportet in ms, but it seems to be seconds.

It took about 160s to write a 132 MB file on my local drive, but the time to read the file will greatly depend on the file system, caching and the reading strategy. Depending on your system the reading will be much faster the second time your open the file.

Notepad++ opens the file without any delay, as they only read the part that is displayed. The standard Editor takes about one minute to do the same, as they read the whole file at once. We can do the same on reading the file in chunks, as file operations are handled in the background by the hard drive, this operations do usually not put any strain on the processor.

So, if we do so we get a nice contest on file operations, but as you have not much control over the details on most high level languages, what does the contest really measure?

Collapse
 
ashishpawaskar profile image
Ashish Pawaskar

Beautiful!! Your method refined my thought-process significantly... thanks for this!!

Collapse
 
marllongomes profile image
Marllon Gomes

Great article! it's always good to learn new performance techniques

Collapse
 
_maramazza_ profile image
Matteo Mazzoli

@realflowcontrol Thanks for the fantastic article! If i can, there is a typo..
CFLAGS and CLFAGS

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Fixed it, thank you and awesome you liked it!

Collapse
 
kimsal profile image
Michael Kimsal

Could you post the code to a github repo? Would be nice to get a link to this on 1brc.dev - however, the convention seems to be to link to a github repo, not a blog post.

Excellent work!

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Hey there,
thanks for the hint, I created a GitHub repository and added the link to the issue you opened already! Thanks for that!

Collapse
 
kimsal profile image
Michael Kimsal

@realflowcontrol The repo is now linked from 1brc.dev. I was about to submit a PR but I think you beat me to it.

Once again, great work!

Collapse
 
fanfan1609 profile image
Dat Vo

Great post. Thanks for your insight and approach.

Collapse
 
sina_zahed profile image
sina

Thank you so much for this post, it was realy grate

Collapse
 
aspokes profile image
Albert Angmor

This was incredible. Your work has given some credibility yo my favorite language. Hats off to you.

Collapse
 
efpage profile image
Eckehard

Interesting task. Just, the dataset is not very realistic. Usually you would have a time stamp for each reading. So, you would need to process the time stamps too.

In a real live case, you would try to avoid those "brute force" reading:

  • Put your data in a time series database like influx-DB
  • Use separate tables for sepearate weather stations to make the number of data to be processed smaler
  • Use some "pre-aggregation": If you know the average for each hour, you can calculate a yearly average from 8760 values, and not from every reading.
Collapse
 
saintpetejackboy profile image
Jack • Edited

This is the right answer, and even simple business metrics get segmented like this. The actual "solutions" you end up needing in the real world come down to utilizing efficient queries and table structures in synchronicity with caching. Intelligent database design is what you need to process through billions of rows... Lowly SQL, so it confuses me, also, when people compare how fast a language is at performing tasks which, in real life, are going to be handled by the database. I appreciate the thought exercise, but we end up arguing about which orange soda tastes best at the apple pie fair.

Collapse
 
enriquefrances profile image
enrique frances

Hey! It's a great attempt form you

Collapse
 
hazzazbinfaiz profile image
Md. Hazzaz Bin Faiz

Which tool are you using for profiling and visualization of profiling output?

Collapse
 
realflowcontrol profile image
Florian Engelhardt • Edited

In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend.

Collapse
 
frantisek-heca profile image
Frantisek Heca

If, after php optimizations, the speed is capped mostly by the "line reading". I do wonder, if it is allowed to install any extension in this test? The idea is to "hack" customized version of the "stream_get_line" just for this scenario.

Collapse
 
realflowcontrol profile image
Florian Engelhardt

I mean I had to install ext-parallel in order to run threads in user land. I guess something that is specifically targeted at the 1brc challenge might count as cheating, but some extensions that do exist already, like ext-ds for data structures, why not? I am currently unaware of an extension that bypasses the PHP stream layer, but something like this could help.

Collapse
 
sina_zahed profile image
sina

Hi , thank you so much for this test , i wanted to test that but creating a dataset with 1000000000 row takes so much time , maybe 1 ! week , and the final size would be about 20 gig. i have created a bash script that generate the data , is there any optimizer way or what !?

Collapse
 
realflowcontrol profile image
Florian Engelhardt

You are welcome!
The resulting file should be around 13 GB and there is a faster version of the generator available in the original repo at github.com/gunnarmorling/1brc in create_measurements_fast.sh

Collapse
 
ivanzanev profile image
Ivan Zanev

Great step-by-step optimization. Two things coming to mind: if you use explode or a simple for loop to iterate through characters, that might cut off some cycles. Also, I'm wondering about the ksort() since it sorts by cities, maybe there can be an array and then just iterate through the sub-arrays?

Collapse
 
ravavyr profile image
Ravavyr

This was a pretty cool write up, giving people an idea of how to approach optimizing their code. really enjoyed that.
A couple of questions:

  • what is your hardware [RAM? CPU?]
  • have you tried outputting the processed code [i'm pretty sure rendering a billion rows would crash just about any browser?]
Collapse
 
realflowcontrol profile image
Florian Engelhardt • Edited

Hey Ravavyr,
glad that you liked it!
I ran the code on M1 MacBook Pro with 64 GB of Ram.
The code just outputs to stdout which on invocation I do redirect to some file on disk.

php run.php > out.json
Enter fullscreen mode Exit fullscreen mode

Hope this helps 🖖

Collapse
 
rujorgensen profile image
rujorgensen

This was interesting in spite of being on the topic of PHP 😄

Collapse
 
rrsai_rd profile image
R. Rsai - RD

I’ve never used a profiler in PHP before, so maybe could we get a link to the profiler?

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Hey there,
the profiler I used is the one I am working on in my day job. You can find it at github.com/DataDog/dd-trace-php/tr...
This profiler sends the gathered samples to the Datadog backend where I took the screenshots from.

Collapse
 
daredzik profile image
Przemysław Lazarek

Hi!
maybe you will use swoole for co-routines and async::read for asynchronous reads?

Collapse
 
sina_zahed profile image
sina

Thank you , what php profiler did you used ?

Collapse
 
realflowcontrol profile image
Florian Engelhardt

In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend.

Collapse
 
chx_90 profile image
chx

I wonder, would it be possible to use fibers here (which is core since 8.1) instead of parallel?

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Fibers are cooperatively scheduled, meaning they run in your current thread and you have to schedule them manually. We can't use fibers to do parallel processing like it is possible with threads.
There is a nice writeup about it at php.watch/versions/8.1/fibers

Collapse
 
jaspersmitnl profile image
jaspersmitNL

Which tool did you use to create the timeline view of the performance of the PHP scripts?

Collapse
 
realflowcontrol profile image
Florian Engelhardt

In my day job im building a PHP profiler for Datadog where this timeline visualisation is relatively new.