If you have sent push notifications with Parse Server you are probably familiar with the issues that come with having a larger audience. The way the Parse Server's push notifications work is the following: your Installations are fetched in batches, these batches are then sent respectively to FCM(Android) and/or APNS(iOS).
One of the issues is that the progress of these batches is not tracked. Let’s say your application has 2 million installations, currently the Parse Server would take roughly 5 to 10 minutes to send push notifications to all of them. In this time a number of things can happen to your Parse Server, e.g. it might crash or restart due to a new deployment. This would cause the process of sending push notifications to halt in an undefined state and some users will never receive the said notifications.
Moreover, the process is not distributed. This means that you can't take advantage of horizontal scaling because only one instance of your application is processing the workload at a time. Oh yeah, and lets face it, Node.js doesn't do really well under high loads.
What were our options? We could
- Create a new Parse Server adapter to address the points above - That sounds like a good plug-and-play idea, but we would ultimately still be limited to Node.js and that wouldn't be ideal in the case of millions of Installations per application.
- Use an external service, such as OneSignal or Amazon SNS - That would be a good option, however, each service has its own set of limitations and we would have to work around them. Also synchronizing Installations to the respective service alternative and getting a detailed response for each notification would be far from ideal.
- Create our own service - Have complete control over the whole process, which would allow us to provide great experience to our customers and also give us the opportunity to iterate over the solution further over time. This way we would also completely take the load off the Parse Server.
We made our choice - create an external service that will be able to handle the demands of our customers. There are a few prerequisites that we had to meet:
- Send push notifications as quickly as possible
- No notification can be dropped, ever
- Stick to the SashiDo no vendor lock-in policy - we must not change the way the Parse Server is working. Should you choose to host your Parse Server somewhere else, the push notifications would continue working, just without the benefits described in this article
- Perform well under high loads and don't consume too many resources
- Be able to scale each component of the system according to demand
- Keep database reads and writes within a tolerable range since we are using each Parse application's database (SashiDo no vendor lock-in policy, our customers own their data!)
First of all, a fitting language had to be chosen. We could choose from Rust, Go and maybe Elixir. Due to our domain knowledge with Go, that's what we picked. We all know what Go brings to the table with its great concurrency model and easy deployment, so I will not go deep into the rationale behind this choice. It suffices to say that we are very happy with the results. If you are not familiar with the language, here's a great article About Go Language - An Overview.
To be able to scale each component of the system independently of course, we went with a microservice architecture. Interestingly enough, we started with only 2 microservices. Their numbers quickly began to grow and we ended up with a total of 8 microservices. Bellow, you can see a simplified schematic of the architecture.
Let’s go through the workflow real quick. When a new push notification gets to the Parse Server it's sent to Push Notifications Service's REST API. This request then gets to the Installations batchers, which is a group of microservices that read installations from the respective application's database in batches. "Why in batches?" one would ask. There are two main reasons for this. First - we want to distribute the process across microservice instances for fault tolerance reasons and second - this allows us to control the amount of read items from the database at a time and per query. With this approach, we are able to read millions of installations with no significant impact on the database. After each batch of installations is fetched, each installation is sent to a respective Sender. Currently, we have two senders - iOS and Android, which use APNS2 and FCM respectively. And when each push is delivered to either APNS2 or FCM, the response is passed to the Status workers. In a similar fashion as the Installation batchers they make sure not to stress the database too much while saving the statuses. This way we can scale the Senders as we see fit without worrying about the workload put on the database.
"This is great but, what are Redis and NATS Streaming doing there?" - Glad you asked. We are using Redis for caching of course, but not only. Each microservice uses Redis to store the progress of its operations. This is to ensure that if an instance is to die unexpectedly or another failure occurs, the next one that handles the operation will continue from the same place.
Let me tell you the story behind why we chose to include NATS Streaming in our stack. For those of you who don't know NATS Streaming is a data streaming system/message queue built on top of NATS Server. But why did we choose it? Well, it turns out when you want to have a message queue capable of processing 200 000 messages per second your options are kind of limited. For example, we tried with RabbitMQ at first, but quickly proved one of our expectations that reaching these numbers and having high availability would require some pretty solid hardware. For example, you can read a great article on how to reach 1 million messages per second with RabbitMQ by using 32 machines, 30 of which with 8 vCPUs and 30 GB of RAM - RabbitMQ Hits One Million Messages Per Second on Google Compute Engine. NATS Streaming on the other hand is simple, built with Go and crazy fast. One drawback is that it still doesn't support clustering, however, our DevOps guys were able to cast some black magic to make it work with the fault tolerance capabilities it currently provides. The results? We were able to get out 100 000 incoming and 100 000 outgoing messages per second out of just 3 VMs. We also saw NATS was very stable with high loads. Below you can see some stats from our tests.
After switching to the Push Notifications service, which is enabled by default with Parse Server 2.3.3 on SashiDo, you will be able to send push notifications to your clients about 20 times faster than before, without pushing the limits of your Parse Server and slowing down other requests. You will also no longer need to worry about restarts, crashes or deployments.
Read more about the new Parse Server version on SashiDo here: Our new Parse Server Version comes with new service for Push Notifications