Here's a little something I did in Scribe last week: before and after.
The original piece of code iterates over all database connections (like MySQL, SQLite...) configured in a Laravel app and tries to start a transaction in each one. It worked fine, except that it takes a long time, because Laravel ships with a bunch of preconfigured connections that you may never use, and Scribe has no way of knowing which ones you're using.
So I had to speed things up. First I thought of adding a timeout, so we could quickly exit if it took more than a few milliseconds to connect to any database. But I soon realised the problem: I/O in PHP is blocking.
Now that's something people say a lot, but it's in moments like this that you realise what it actually means. "Blocking" means that while one thing is executing, nothing else can (ie the currently executing task is "blocking" others). I/O in PHP being blocking means that an input or output operation has to be completed before you can do something else.
By itself, that's nothing strange, until you realise the alternative. In a language like JavaScript (non-blocking I/O), I could have done something like this:
db.startTransaction();
setTimeout(() => {
// continue after 1.5 seconds
}, 1500);
In this case, the code in the setTimeout would run after 1.5s, regardless of whether the database connection had been completed or not. That's because db.startTransaction()
is an I/O operation, and is non-blocking. It gets started, but it doesn't have to finish before the next things can run. This is why:
- we often pass a callback or
Promise.then()
handler containing the code that should only be run after the I/O is done - doing a lot of I/O in a single Node.js request will be faster than in PHP because they don't have to be done one after the other
Note that the non-blocking thing only applies to truly asynchronous functions (like I/O). If I had a function that was completely synchronous, like taking an input and calculating a result, it would have to finish executing before the timeout is even set.
So, yeah, PHP is blocking, so using a timeout with a function was out of the question. There are a number of workarounds, but they're not very robust.
But then I remembered Amp. Amp (and ReactPHP) are frameworks for asynchronous programming in PHP. I personally prefer Amp, because it lets you write async PHP code within a regular synchronous PHP app, and I've found it easier to wrap my head around. The best part about these is that you don't need to install any PHP extensions; you just require them with Composer.
So I decided to switch from my timeout idea to running the requests in parallel instead. Amp has a nice package for this. And so I ended up with the second version. It's essentially the equivalent of await Promise.all()
in JavaScript, and it sped things up immensely.
Just took a process down from 4 minutes to 3 seconds with @asyncphp 🙌
— jukai (樹海) (@theshalvah) November 11, 2020
How does it work internally? Haven't looked at the code, but my guess (simplified):
- for each value in your list (
$connections
), Amp creates a wrapper function like this:
function runThisTaskInNewProcess() {
// Your variables
$yourFunction = // your function code
echo serialise($yourFunction());
}
- The "your variables" part contains all the data your function needs (in my case,
$connection
). Amp serialises them and the wrapper function will useunserialise()
to parse them. -
"your function code" also contains your serialised function, wrapped with
unserialise
to turn it into a PHP variable. In PHP, closures aren't serialisable, so Amp uses a library for that. So the wrapper function in my case would probably look something like this:
function runThisTaskInNewProcess() { $connection = unserialize('O:8:"stdClass":0:{}'); $yourFunction = unserialize('O:12:"Opis\\Closure":0:{}'); echo serialize($yourFunction($connection)); }
For each value, Amp spins up a new process in your OS with exec:
exec("php -r 'the wrapper function code'");
The final
echo serialize($yourFunction());
is so Amp can get the return value of your function from the output, unserialise it and pass it back to you.
Serialisation is key here. It's like encoding variables in a specific text format (think of JSON.stringify()
) so you can pass them around and unserialise (decode) them to get the exact PHP value. JSON encoding only supports the JSON data types, but serialise
supports all PHP data types.
Of course, you have to take other things into consideration when doing this. For instance:
- State management/race conditions: Since I'm running multiple processes at once, I have to be careful about two different processes trying to do the same thing. In my current implementation, there's a potential race condition for when two different connections use the same database.
- Debugging: Debugging is harder because Amp spins up new processes in your OS, and I don't think Xdebug can follow them. And if you're doing dump-and-die, the process you kill might be the wrong one.
- Output: Obviously. Since things are running in parallel, you can't be sure of the order of output anymore.
- Error handling: Amp wraps errors in a MultiReasonException, and calling
getMessage()
simply tells you "Multiple errors occured". You have to iterate over each wrapped exception and get its message. - Unserialisable data: I ran into this issue early on, because at first, I was trying to run the whole Scribe application in parallel. But the variables I needed in my function had closures that couldn't be serialised, so I was stuck, until I reduced the scope to run only that part in parallel.
UPDATE: Turns out in my use case, Amp wasn't really working, because of the last reason I stated here^^^: a database connection can't be serialised.😅 But it's still a viable approach if you're only dealing with PHP-native objects.
Top comments (2)
Hey, somewhat old post, but still wanna question - why didn't you go further, e.g. serializing connection details, and having processing function connect to db? It would consume some of db pool, but 4 threads-etc still better than 1.
Interesting suggestion. I'm not sure how much time it would have saved in this case though, because it turns out the bottleneck was waiting for a (failing) database connection. The complexity of trying to serialise and instantiate a be connection isn't worth it at this point.
I still have Amp at the back of my mind, though. There may be other places where it might help me improve speed.