Pavel Murzakov

Posted on Oct 20, 2022

PHP preload VS running as a daemon (benchmarks)

#php

PHP-FPM, Apache mod_php, and similar ways of running PHP scripts and processing requests (which run the vast majority of sites and services; for the sake of simplicity, I will call them classic PHP) work on the principles of shared-nothing, as a general concept:

State is not shared between PHP workers;
State is not shared between different requests (even in case they’re handled by the same worker).

Let's take a look at an example of a simple script:

// init
$app = \App::init();
$storage = $app->getCitiesStorage();

// logic
$name = $storage->getById($_COOKIE['city_id']);

echo "Your city: {$name}";

For each request, the script is executed from the first to the last line. Even though initialization does not normally differ from request to request and it can potentially be performed only once for saving resources, it still has to be repeated for each request. We can't just save variables (like $app) between requests due to the way classic PHP works. (APCu can cover some cases, but it is still not the same as local memory)

What might it look like if we went beyond classic PHP?

For example, our script could:

Run outside of the scope of request before any request is handled;
Perform initialization and have a request handling loop, inside which it would wait for the next request;
Process a request and repeat the loop without clearing the environment (I will call this solution PHP as a daemon).

// init
$app = \App::init();
$storage = $app->getCitiesStorage();

$cities = $storage->getAll();

// request handling loop
while ($req = getNextRequest()) {
    $name = $cities[$req->getCookie('city_id')];

    echo "Your city: {$name}";
}

Not only could we get rid of initialization repeating for each request, but we also stored the list of cities once in the $cities variable. Now we can use it for different requests accessing it directly in memory (which is the fastest way to access data from PHP), unlike when we fetch it each time from an external source.

The performance of such a solution is potentially significantly higher compared to classic PHP. But usually, you have to pay the price for a performance increase. Let's see what price we might have to pay in our case.

To do so, we’ll modify our script a bit in the following way — instead of printing the $name variable, we will fill in the array:

-    $name = $cities[$req->getCookie('city_id')];
+    $names[] = $cities[$req->getCookie('city_id')];

In the case of classic PHP, there will be no problems. At the end of the request, the $name variable will be destroyed, and each subsequent one will work as expected. In the case of running PHP as a daemon, each request will add the next city to this variable, leading to uncontrolled growth of the array until the server runs out of memory.

In general, not only may memory run out, but some other errors may occur that will crash the process. Classic PHP handles all of these issues by default. In the case of running PHP as a daemon, we need to monitor this daemon somehow and restart it in case of a crash.

These types of errors are unpleasant, but there are effective solutions for them. It is much worse when the script does not crash due to an error but unpredictably changes values of some variables instead (for example, it rewrites $cities array with something unexpected e.g., with some resource). In this case, all subsequent requests will see invalid data, while in classic PHP, this error would be isolated within one request.

To summarize, it is easier to write code for classic PHP (PHP-FPM, Apache mod_php, and others alike) as it protects us from a number of problems and errors. However, we have to pay a price for this and see a decline in performance.

As we can see from the examples above, to process any request, classic PHP has performance overhead, running parts of the code for each request repeatedly. In fact, some of them could be run only once for all requests. We can breakdown all of them roughly into the following categories:

Import of PHP-files (include, require, etc.);
Initialization (framework, libraries, DI container, etc.);
Requesting data from external storage (DB, Memcache, Redis, etc).

PHP has been around for many years and has become popular in large part due to this model. At the same time, there have been a great number of new methods to tackle the downsides of this model. In this article, I focus on two of them: preload (introduced as a part of PHP 7.4) and running PHP as a daemon (I’ll be using RoadRunner as an example, but everything below applies to all other similar frameworks, such as AMPHP, Swoole, and so on, regardless of whether asynchronous architecture is being used or not).

Preload

Preload was designed to address the first problem from that list — to eliminate overhead on importing PHP files. At first glance, this may seem strange and pointless since PHP already has OPcache, which was created exactly for this purpose. To figure everything out, let's use perf to profile the real code with OPcache enabled and a 100% hit rate.

Despite OPcache covering imports of all PHP files, we see persistent_compile_file taking up 5.84% of query execution time.

In order to understand why this is happening, we can look at the source code of zend_accel_load_script. Despite OPcache, each include / require call requires signatures of classes and functions to be copied from shared memory of OPcache to the memory of the worker process. Also, some other support work should be done. This must be carried out for each request, as the memory of the worker process is cleared whenever the request handling is finished. Next time all the work has to be repeated.

This problem is compounded by a large number of include/require calls typically made while handling a single request. For example, Symfony includes about 310 files before executing the first useful line of code. Sometimes this happens implicitly — to create an instance of class A below, PHP will autoload all other classes (B, C, D, E, F, G).

A bad situation turns worse whenever we deal with Composer's dependencies that declare standalone functions. To ensure that these functions are available during user code execution, Composer is constantly forced to include them regardless of whether they are being called or not. This happens because PHP autoload doesn’t support autoloading of functions, so they cannot be loaded on the fly upon call.

class A extends \B implements \C {
    use \D;

    const SOME_CONST = \E::E1;
    private static $someVar = \F::F1;

    private $anotherVar = \G::G1;
}

How preload works

Preload has a single main setting, opcache.preload, that accepts the path to the PHP script. This script is executed once PHP-FPM/Apache/etс is started. All signatures of classes, methods, and functions declared in this file will become available to all scripts processing requests from the first line of their execution. Please, note that, in contrast to this, values of variables and global constants (which are not declared as members of any class) will be reset after the end of the preload phase. There is no more need to make include/require calls and copy function/class signatures from shared memory to process one — they will be declared immutable and due to this, all processes can refer to the same chunks of memory containing them.

Usually, the classes and functions are declared in different files upon development. It would have been inconvenient to combine them into one preload script. Fortunately, this is not necessary. Since the preload script is a normal PHP script, we can simply use include/require or opcache_compile_file() from the preload script to preload all the other files we need. In addition, since all these files are preloaded only once and become immutable, it allows PHP to perform additional optimizations which were impossible to make when these files were included individually at the time of request handling. Usually, PHP performs optimizations only within each individual file, but in the case of preload it’s possible to do it for all code loaded in the preload phase like it was compiled into one file.

Preload benchmarks

To demonstrate the benefits of preload in practice, I took one CPU-bound endpoint on the production server that handles user requests and experimented with it. The load on this backend is generally CPU-bound. The same is applicable for the vast majority of other setups and projects that implement at least some minimal logic apart from making requests to services, querying databases, and so on. That's why I mentioned that in this case, we could compare the performance of preload with any "daemon-style" framework, whether it's asynchronous or not, as the asynchronous one doesn't provide any advantages when it comes to CPU-bound load. Apart from this, the use of asynchronous frameworks should be carefully considered as they complicate the code even further (as the code needs to be written in different ways in this case). Also, they require special asynchronous libraries to work with a network, I/O, or other.

To get the most out of preload, I preloaded all files that the experimental endpoint needs to include. As a benchmarking tool, I use wrk2 — a more advanced Apache Benchmark analog — to keep it simple and provide more flexibility to generate loads similar to a real-life one.

Additionally, I compared it to an older version of PHP (to be more specific, PHP 7.2).

Here’s the results:

As you can see, the transition from PHP 7.2 to PHP 7.4 gives a + 10% performance boost on experimental endpoint, and preload gives an additional +10%.

Results achieved when using preload will significantly depend on either of two factors. First, the simpler logic of included files, the bigger performance gain will be. Or the same is true for the case of including more and more files.

Preload pitfalls

As it often happens, performance gains come with downsides. You have to keep in mind a lot of nuances when it comes to preload. Let’s try to examine some of them. In my opinion, only the first one might be an actual dealbreaker, while the rest should be taken into account.

Restart on any change of the code

Since all preloaded files are compiled during the startup, marked as immutable and will never be recompiled again, the only way to apply changes to these files is to reload or restart PHP-FPM/Apache/other.

If reload is used, PHP will try to restart as carefully as possible: user requests will not be interrupted. However, while the preload phase is in progress, all new requests will wait for it to be completed. It is not a big deal if the preload script is not big enough, but if you try to preload the entire application, it might require a significant amount of time for the requests to wait while the application reloads.

Also, it is worth noting that reload and restart clear OPcache memory so all subsequent requests will work with a cold opcode cache. That can increase the response time even further.

Unresolved symbols

If we take the class from the screenshot below as an example, it means that all other classes (B, C, D, E, F, G), the $someGlobalVar variable, and the SOME_CONST constant must be declared before this class is compiled. Since the preload script is a normal PHP code, we can declare an autoloader to simplify the task. It will allow us not to worry about loading everything related to other classes beforehand, as the autoloader will take care of that. Unfortunately, this trick does not work with variables and constants not belonging to classes: we must ensure they are defined before the first use.

class A extends \B implements \C {
    use \D;

    const SOME_CONST = \E::E1;
    private static $someVar = \F::F1;

    private $anotherVar = \G::G1;
    private $varLink = $someGlobalVar;
    private $constLink = SOME_CONST;
}

Fortunately, preload comes with debug tools that help to check if a symbol has been resolved during the preload phase. It will display warning messages with information about which ones were not preloaded and why:

PHP Warning: Can't preload class MyTestClass with unresolved initializer for constant RAND in /local/preload-internal.php on line 6
PHP Warning: Can't preload unlinked class MyTestClass: Unknown parent AnotherClass in /local/preload-internal.php on line 5

Also, preload will provide additional debug info as a separate section of opcache_get_status(), explaining what was successfully loaded in the preload phase:

Class fields/constants optimization

As mentioned above, preload resolves a class's fields / constants values and saves them for all subsequent requests. It allows the dynamic values to be computed just once, saving some resources by not recalculating it on each request. But this can also lead to counter-intuitive results, for example:

const.php:

<?php
define('MYTESTCONST', mt_rand(1, 1000));

preload.php:

<?php

include 'const.php';
class MyTestClass {
    const RAND = MYTESTCONST;
}

script.php:

<?php

include 'const.php';
echo MYTESTCONST, ', ', MyTestClass::RAND;
// 32, 154

At first glance, the result might look weird. You might have expected the constants to be equal since one of them was assigned the value of the other. But this was not the case because global constants, unlike class constants/fields, are forcibly cleared after the preload phase and have to be redefined on each request. Meanwhile, class constants/fields are resolved only once during the preload phase and then stored for subsequent requests. As the global constants are defined repeatedly during request handling, they may receive a different value from the same expression compared to how it was resolved during the preload phase.

Cannot redeclare someFunc()

Dealing with classes is simple since we usually do not explicitly include them but use an autoloader instead. So, if a class is defined in the preload phase, then the autoloader will not be executed during the request. We will not try to include this class again.

Functions are different. We must include them explicitly. Often, those who use preload might accidentally include files with functions twice — during dealing with the preload and handling the request. This will lead to a Fatal error.

There are multiple ways to solve this issue. For example, if you are working with the Composer's loader you can include everything from it in the preload script and not do that at all while handling the requests. Another solution is not to include files with functions directly. Instead, we can do it through a proxy file that will use function_exists() before including a real one with the function definition. A lot of libraries do the latter by default, for example, Guzzle HTTP:

PHP as a daemon

I am going to use the RoadRunner framework to run PHP as a daemon. You can use any other similar framework (AMPHP, Swoole, and so on) as they’re expected to provide more or less the same results for CPU-bound loads.

RoadRunner is a daemon written in Go. On the one hand, it creates and monitors PHP workers (starts/stops/restarts them). On the other hand, it receives requests and passes them to these workers for handling. In either case, its work is no different from the work of PHP-FPM (which also has a master process that monitors the workers). But there are huge differences between them. The key one is that RoadRunner does not reset the script state after the request processing is completed.

Using RoadRunner, we can potentially optimize all the areas from the list above:

Import of PHP-files (include, require, etc.);
Initialization (framework, libraries, DI container, etc.);
Requesting data from external storage (DB, Memcache, Redis, etc.).

Here’s how the “Hello World” application looks like in RoadRunner:

$relay = new Spiral\Goridge\StreamRelay(STDIN, STDOUT);
$psr7 = new Spiral\RoadRunner\PSR7Client(new Spiral\RoadRunner\Worker($relay));

while ($req = $psr7->acceptRequest()) {
        $resp = new \Zend\Diactoros\Response();
        $resp->getBody()->write("hello world");
        $psr7->respond($resp);
}

Let’s try to run our demo endpoint (previously tested on plain PHP and PHP with preload enabled) on RoadRunner. Before that, we’ll modify the “Hello world” example in the following way.

First, we do not want the worker to crash in case of an error. To do this, we need to wrap everything in a global try..catch.

Secondly, since our script does not know anything about Zend Diactoros, we will need to convert its results to Zend Diactoros. To do this, we’ll use the ob_- functions.

Thirdly, our script does not know anything about the PSR-7 request. So we’ll create standard PHP environment variables from PSR-7 entities.

Lastly, as our script won’t die after request handling, we must clear the state ourselves.

Here’s the rough result:

while ($req = $psr7->acceptRequest()) {
    try {
        $uri = $req->getUri();

        $_COOKIE = $req->getCookieParams();
        $_POST = $req->getParsedBody();
        $_SERVER = [
            'REQUEST_METHOD' => $req->getMethod(),
            'HTTP_HOST' => $uri->getHost(),
            'DOCUMENT_URI' => $uri->getPath(),
            'SERVER_NAME' => $uri->getHost(),
            'QUERY_STRING' => $uri->getQuery(),

            // ...
        ];

        ob_start();

        // our logic here

        $output = ob_get_contents();
        ob_clean();

        $resp = new \Zend\Diactoros\Response();
        $resp->getBody()->write($output, 200);
        $psr7->respond($resp);
    } catch (\Throwable $Throwable) {
        // some error handling logic here
    }

    \UDS\Event::flush();
    \PinbaClient::sendAll();
    \PinbaClient::flushAll();
    \HTTP::clear();
    \ViewFactory::clear();
    \Logger::clearCaches();

    // ...
}

Benchmarks

Let’s run our benchmarks.

The result doesn’t look the way we may have expected it to be. RoadRunner should have eliminated more areas that caused performance overhead, but it did not happen. Let’s find out why it happens with the perf.

In the perf results we can see the phar_compile_file() because we include some files during script execution inside the request handling loop. Since OPcache is not enabled (RoadRunner runs scripts as CLI where OPcache is disabled by default), these files are recompiled on each request.

Let's edit the RoadRunner configuration enabling OPcache.

These results are better. RoadRunner started showing better performance than classic PHP with preload.
Let’s profile it further to find out if we can potentially gain more.

perf doesn’t show anything suspicious anymore, so let's look at the PHP code itself. The easiest way to profile it is to use phpspy. It doesn't require any modification to the PHP code. You need to run it in the console on the same server where the code is run. Let's do it and build a flame graph.

We will not modify the business logic itself since we need to keep experiments' conditions equal for the results to be fair. Let's look into branches related to RoadRunner

The biggest part of those branches lead to fread() — we hardly can do anything about it. But apart from that we can see other branches in \Spiral\RoadRunner\PSR7Client::acceptRequest(). Let’s look at the source code to find out what they are for.

   /**
     * @return ServerRequestInterface|null
     */
    public function acceptRequest()
    {
        $rawRequest = $this->httpClient->acceptRequest();
        if ($rawRequest === null) {
            return null;
        }

        $_SERVER = $this->configureServer($rawRequest['ctx']);

        $request = $this->requestFactory->createServerRequest(
            $rawRequest['ctx']['method'],
            $rawRequest['ctx']['uri'],
            $_SERVER
        );

        parse_str($rawRequest['ctx']['rawQuery'], $query);

        $request = $request
            ->withProtocolVersion(static::fetchProtocolVersion($rawRequest['ctx']['protocol']))
            ->withCookieParams($rawRequest['ctx']['cookies'])
            ->withQueryParams($query)
            ->withUploadedFiles($this->wrapUploads($rawRequest['ctx']['uploads']));

The source code shows that RoadRunner is trying to create a PSR-7-compatible request object from the serialized array. If your framework works with PSR-7 request objects directly (for example, vanilla Symfony does not), then this is completely justified. Otherwise, PSR-7 becomes an extra step before the request is converted into something your application can work with. Let's remove this intermediate step and look at the results again.

The experimental endpoint was light enough, and we managed to gain +17% using RoadRunner (compared to pure PHP).

RoadRunner pitfalls

RoadRunner is a much more serious change to the architecture than just enabling preload. Consequently, it comes with more significant areas to take care of.

Firstly, RoadRunner runs PHP code in daemon mode (as described above), which means that it is subject to all problems I wrote about at the beginning of the article. Specifically, it allows a whole new class of errors to be made, making the code more difficult to write and debug.

Secondly, if we want to get the most out of RoadRunner, it’s not enough just to run the classic code on it — you need to start writing a specific code for it from the very beginning. In this case, we will avoid hacks with the request / response transformation between the RoadRunner and application formats. We can write the code in a way that does not require cleanup at the end of the request, and we will also be able to use the power of the same memory used by different requests, for example, by caching something.

Lastly, remember that all results overwhelmingly depend on the particular endpoint on which benchmarks are performed.

Conclusion

We looked at the architecture of the classic PHP and figured out how the preload feature can help gain more performance and how the RoadRunner architecture differs from it.

Classic PHP (PHP-FPM, Apache mod_php, and others) helps simplify development and avoid several problems. It is the most efficient way if the application have special requirements performance-wise. After all, there are many ways to achieve higher performance, such as the PHP preload feature and JIT.

Suppose you can tell from the start that the application will be overloaded. In that case, it would make sense to consider using RoadRunner (or other frameworks like AMPHP, Swoole, etc.) as it is often superior when it comes to performance.

Summary of the results of the experiment:

PHP 7.2 - 845 RPS
PHP 7.4 - 931 RPS;
RoadRunner without optimizations - 987 RPS;
PHP 7.4 + preload - 1030 RPS;
RoadRunner after optimization - 1089 RPS.

DEV Community