DEV Community

Cover image for How I built architecture of uptime monitoring service
Victor
Victor

Posted on

How I built architecture of uptime monitoring service

Table of contents

Intro
Requirements
VPS Provider and Servers
Technologies used
How a node works
How the core server works
To sum up
What didn't work
Why it took so long?

Intro

Hi there!

This is my first article so it might not be perfect ๐Ÿ˜‡

Almost a year ago I decided to build an uptime monitoring service, which resulted in https://pingr.io.

It is a web app that continuously checks that your URL responds with 200 OK or any other response code you prefer.

So the idea is pretty straightforward. You might think that it's not that hard to make an HTTP request after all.

But it took me a year. ๐Ÿ˜ฑ

In this article I want to describe how it works.

I don't claim that the architecture is good.

The architecture is somewhat the compromise between launching your product asap and having excellent architecture.

I didn't want to spend yet another year of making it perfect, since it's pretty hard to keep motivation during the development.

I'd love to hear what can be improved and where I did it completely wrong.

Let's dive in.

Requirements

At first, it seems like it's an easy task to make an HTTP request. However, there are a lot of requirements for what the service should do. Here are some of them:

  1. The URL should be checked from multiple nodes distributed around the world.
  2. Minimum check frequency is 1 minute
  3. You should be able to add new nodes
  4. Every node should make a decent number of HTTP requests per minute. At first, I was okay with 500-1000. But when the user number grows, it might be much more.
  5. Apart from HTTP checks, the service should also check SSL certificates (if they are valid/expires soon, etc.)
  6. The UI showing URL status should be updated in realtime
  7. The user should be notified of uptime/downtime events via different ways depending on his choice

VPS Provider and Servers

I think that if you want to have your product up and running as soon as possible, it's okay to pay more for services which you you are comfortable with.

So I tried both ScaleWay and Digital Ocean but eventually moved all my servers to Digital Ocean since I liked it more.

What I have:

  1. Ubuntu VPS with MySQL. 3 GB / 1 vCPU
  2. 5 Ubuntu VPS which are located in different parts of the world. 3 GB / 1 vCPU
  3. Ubuntu VPS for the core server. 4 GB / 2 vCPUs
  4. Redis database: 1 GB RAM / 1vCPU

Regarding MySQL. In the beginning, I used their RDS, since it's meant to be used as a database, and everything should've been cool. But when I wanted to change my.cnf I realized that I couldn't do that. And I like to have control of everything, at least in the beginning while I'm learning.

So, for now, I decided to have just a VPS with MySQL installed since it gives me more control.

Why MySQL? No idea. I've just worked only with this database.
Why Ubuntu? No idea. I've just worked only with this OS. ๐Ÿ˜

Technologies I used

  1. Laravel for backend
  2. VueJS for frontend
  3. MySQL for Database
  4. Redis for queues
  5. cURL for sending requests (Guzzle library)
  6. Supervisord for keeping my workers up and running
  7. ...
  8. PROFIT!

Listing of artisan commands

When I was writing this article, I realized that it might help a lot if I list all the Laravel's commands which I use here:

  1. php artisan monitor:run-uptime-checks {--frequency=1} - Dispatches uptime checks jobs. Ran by cron every frequency minutes
  2. php artisan checks:push - Fetches check results from the local Redis database to the temporary MySQL table. Kept running continuously by Supervisord
  3. php artisan checks: pull - Fetchs check results from the temporary MySQL table and calculates monitor status/uptime/other indexes. Kept running continuously by Supervisord.

How a node works

This is one of the most important parts of the article, which describes how a single node works.

I should note that I have a Monitor entity, representing a URL that should be checked.

Getting monitors ready for uptime check

Alt Text

The php artisan monitor:run-uptime-checks command fetches monitors from the database based on some conditions.

One of the requirements is uptime check frequency, which means how often we should check the monitor. Not every user wants to check their sites every minute.

Then, using Laravel scheduling mechanisim, it's easy to setup running this command with different frequencies.

Passing frequency as an argument helps me to get only the monitors I need to check, depending on what frequency the user has set.

// In the RunUptimeChecks command, we fetch monitors by frequency specified by the user

$schedule->command(RunUptimeChecks::class, ['--frequency=1'])
         ->everyMinute();

$schedule->command(RunUptimeChecks::class, ['--frequency=5'])
         ->everyFiveMinutes();

$schedule->command(RunUptimeChecks::class, ['--frequency=10'])
        ->everyTenMinutes();

$schedule->command(RunUptimeChecks::class, ['--frequency=15'])
         ->everyFifteenMinutes();

$schedule->command(RunUptimeChecks::class, ['--frequency=30'])
         ->everyThirtyMinutes();

$schedule->command(RunUptimeChecks::class, ['--frequency=60'])
         ->hourly();
Enter fullscreen mode Exit fullscreen mode

Filling up the queue

Then I put the uptime check jobs into the Redis queue for every monitor I have. This is what happens in RunUptimeChecks command:

foreach ($monitors as $monitor) {
    RunUptimeCheck::dispatch(
        (object) $monitor->toArray(), 
        $node->id
   );
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘‰ You may notice here something strange: (object) $monitor->toArray().

At first, I passed the Monitor model to the job. However, there is a significant difference: when you pass a model to a job, it stores just a model id in the queue. Then, when the job is being executed, Laravel connects to the database to fetch the model, which resulted in hundreds of unnecessary connections.

This is why I passed an object instead of the actual model, which serialized quite well.

Another possible approach is removing the SerializesModels trait from the job, which also might work, but I haven't tried that.

So after this operation, we have some number of jobs in the queue ready to be executed.

Running uptime checks

Alt Text

To execute the jobs, we need to run the php artisan queue:work command.

What we also need is:

  1. Many instances of the command running, because we need to do as many checks as we can per minute
  2. If the command fails, we need to revoke it and run again.

For this purpose, I used Supervisord.

What it does is it spawns N number of PHP processes, every process is queue:work command. If it fails, Supervisord will rerun it.

Depending on VPS Memory and CPU cores, we might vary the number of processes. Thus, we can increase the number of uptime checks performed per minute.

Storing the temporary result

Alt Text

Now, after we ran a check, we need to store it somewhere.

First, I store the check result in the local Redis database.
Since there might be many processes that continuously push data to the queue, it fits perfectly for this purpose, since Redis is an in-memory database and is very fast.

Then I have another command php artisan checks:push that fetches the checks from the Redis database and does the batch insert into the raw_checks MySQL table.

So I got two tables: monitor_checks and raw_checks. The first one contains the latest successful check of a monitor and all of its failed checks. I do not store every check of every node per minute, since it'll result in billions of records and it doesn't provide much value for end-users.

The raw_checks table serves as a bridge between the core server and all nodes servers.

After each check we need to do a lot of stuff:

  1. Recalculate uptime of every node
  2. Recalculate the average response time of every node
  3. Recalculate monitors uptime/response time based on nodes info
  4. Send notifications if needed

It would be much more reasonable to fetch many checks at once and do all calculations on a server located close to the MySQL server.

For example, the connection between the Indian node and the MySQL server located in Germany is relatively slow.

So the aim is to make the nodes as independent as they can be.

All they do regarding MySQL connection is: fetch monitors which should be checked & store results of checks. That's it.

How the core server works

Alt Text

At the core server, I run the php artisan:checks-pull command, which behaves like a daemon: it has an infinite loop that fetches checks from the raw_checks table and calculates such things as average uptime, response time, and some others.

Apart from that, it is responsible for queueing jobs for downtime notifications.

That's actually it: we have monitors with updated status, uptime & response time attributes.

Updating realtime data

In order to update monitor status on the web app, I use Pusher and Laravel's broadcasting feature.

So the setup is straightforward.

  1. php artisan checks:pull gets a check from the raw table, sees if the monitor is online, and if it's not, it fires the MonitorOffline event, which is broadcasted using Pusher.

  2. Web app sees the new event from Pushed and marks the monitor as offline

To sum up

So, to sum up:

  1. Every node has a cron job, which fetches the monitor list and put the uptime check jobs into the local Redis queue
  2. A lot of threads ran by Supevrisord check the queue and make the HTTP requests
  3. The result is stored in Redis
  4. Then bunch of checks stored in local Redis moved to the raw checks table in the MySQL server
  5. The core server fetches the checks and do the calculations.

Every node has the same mechanism. Now, having a constant stream of checks from the raw table, the core server can do many things like calculated average response time from a node, etc.

If I want to extend the number of nodes, I'll just clone the node server, do some small configuration, and that's it.

What didn't work

  1. At first, I stored ALL checks in the database. Both failed and successful. It resulted in billions of records. But users mostly don't need it. Now instead, I have an aggregation table which stores uptime and response time by an hour

  2. At first, the uptime check job did all the logic itself: I didn't have any temporary/bridge databases. So it connected to the MySQL server, calculated new uptime etc. Which immediately didn't work as soon as the number of monitors increased up to ~100. Because every check job did maybe 10-20 queries. 10-20 queries * 5 nodes * 100 monitors. So yeah, it wasn't scalable at all

  3. Using a dedicated Redis server instead of the raw_checks table. Since the raw checks table behaves like a cache, it might be reasonable to consider using Redis for this purpose. But for some reason, I kept losing checks data.

I tried both the sub/pub features of Redis and just storing the data. So I gave up and used a mechanism that is familiar for me: MySQL.

Also, I think I've read somewhere that Redis is not the best solution if we need 100% confidence in storing data.

Why did it take so long?

Because of multiple factors.

  1. I work at a full-time job, so I could work only in the evenings.
  2. I haven't had such experience before. Every project is a unique one.
  3. I had some psychological problems which I've described in this article
  4. I've checked many different ways of making this work with a high number of monitors. So I was kind of rebuilding the same thing many times
  5. Take into account that I also built a UI and did everything alone.

Top comments (19)

Collapse
 
klvenky profile image
Venkatesh KL

Wow man you're amazing. I have just visited your website and it looks very elegant. No non-sense, very simple & shows what it does. Wonderful.

That's does a lot of motivation to build something that could help people and when it looks like a simple trivial job. I think you've done a wonderful job, that too while you were working at a full-time job. You really have a great productive mindset.

I am sure you should've invested good amount of time planning it so well.

In short, Wonderful idea. All the best
Cheers ๐Ÿบ

Collapse
 
vponamariov profile image
Victor

Thank you! :)

At first, I was tracking the time I spent on the project, and it was like 620 hours. Then I stopped. So in total, I spent maybe 1000+ hours :)

Collapse
 
lifeofkar profile image
Karthik R

Woah amazing.
I have also took up similar project myself and it's still in progress from past couple of months.
Although I've based it via ping check only.
And doesn't include features like you have XD

Hope to complete it very soon ๐Ÿ’—

Cheers

Collapse
 
vponamariov profile image
Victor

Just don't give up. It took me a year to build this!

Collapse
 
priyajain6651 profile image
priyajain665 • Edited

Great work! I would like to read more articles related to... How architecture developed.
Thanks for sharing.
Happy coding ๐Ÿ™‚.

Collapse
 
vponamariov profile image
Victor

Thank you

This is what I need too. It's not hard to find something like "how to center a div" or "jquery send form", but it's hard to find how big products work inside.

For example, if people shared their Database structure or something :)

Collapse
 
vponamariov profile image
Victor

In code we trust ๐Ÿ˜Ž

Collapse
 
mrbenosborne profile image
Ben Osborne

Hi Viktor,

Nice work here and as you said it's not as easy as it looks.

You could use Golang which is superb for its concurrency, each go routine only takes up 2 kilobytes!

I am also taking on this type of project in my spare time and at the planning stages at the moment.

I am using Golang for the processing of queues and actually sending out HTTP requests / pings etc and Laravel for the rest.

Question Time

I am stuck at one thing, the part of the process where the system fetches checks to be processed on the queue etc, what if you have 5 million checks to be processed?

  • You cannot fetch all checks at once, you would run out of memory.

  • You could batch the checks, ie. fetch 200 checks at a time from the database, with this approach let's say it takes 0.1s to fetch 200 checks and push them to the queue.

That would be 5,000,000 (checks) / 200 (batch) * 0.1 (seconds) = 2500 (seconds)

2500 / 60 = 41 minutes...

How is this scale-able and have you hit such an issue yet?

Thanks
Ben

Collapse
 
meatboy profile image
Meat Boy

Love the concept and simplicity of the website. Would be great if user can check any TCP port and/or more popular protocol. I am missing at least DNS, POP and IMAP

Collapse
 
vponamariov profile image
Victor

Well, I have this is plans. But most users need HTTP checks.

Hmm I now realize that it's not easy to integrate them, since they have different attributes I guess.

So I'll need a monitor "type" and bunch of additional columns for monitor_checks table.

The implementation of request should be easy though, so many nice libraries are here :)

Collapse
 
dragstor profile image
Nikola

Great summary, Viktor!
Always the pleasure to read your stuff :)

Collapse
 
vponamariov profile image
Victor

Oh, I know you! :) Thank you!

Collapse
 
ptejada profile image
Pablo Tejada

Are you really making a profit?

Collapse
 
vponamariov profile image
Victor

Started to make it in September

Collapse
 
khandakerjim profile image
Khandaker Jim

What should I use instead of Redis? RabbitMQ or Kafka or do you have any recommendations based on your experience?

Collapse
 
yellow1912 profile image
yellow1912 • Edited

I wonder if we can use coroutine from swoole for example and if the performance would be better.

Collapse
 
vponamariov profile image
Victor

Woah, never heard of coroutine, will check it out
Performance is crucial here since the more checks should be done, the more Memory/CPU I'll need

Collapse
 
tanghoong profile image
tanghoong

Thanks for sharing, I was just started working my own version, reading your post it take quite a lot of scenario to take care.

Collapse
 
olavoasantos profile image
Olavo Amorim Santos

Nice read and great idea.

Did you consider a serverless approach? You could probably save some bucks in infra using lambdas with sqs queues to ping the URLs.