Joe Wilm for OneSignal

Posted on Apr 19, 2021 • Originally published at onesignal.com on Apr 15, 2021

Lessons Learned From 5 Years of Scaling PostgreSQL

#engineering #architecture #database #postgres

For nearly a decade, the open-source relational database PostgreSQL has been a core part of OneSignal. Over the years, we've scaled up to 75 terabyte (TB) of stored data across nearly 40 servers. Our real-time segmentation features have benefited greatly from PostgreSQL’s performance, but we've also struggled at times due to bloat caused by our heavy write load and limitations of the PostgreSQL upgrade path.

In this article, I’ll explain some of the challenges we’ve dealt with while scaling on PostgreSQL and the solutions we've put in place. We're excited to share our lessons learned in this area because we had to figure out much of this the hard way, and we hope that sharing our experience will make things easier for others who are scaling PostgreSQL.

There's a lot of information in this article — you can read it in order or jump around to different sections based on your interests. I recommend reading the High Level Overview of Data section first, then checking out the other sections listed below:

Bloat
Database Upgrades
XID Wraparound
Replica Promotion
Partitioning
Sharding

One final note: our goal is to share our lessons learned at a high level rather than providing a detailed how-to guide. Reference materials on the subjects we cover are provided as appropriate.

High Level Overview of Data

As a multichannel messaging platform, our primary data set is subscribers. In terms of push notifications, a subscriber is identified as a push token , subscription status , and data tags (custom key : value pairs of string or number data that can be added to devices through our SDK) that support user segmentation.

We have over one billion monthly active subscriptions, and tens of billions of subscribers whose subscription status is unsubscribed. These records are written to very frequently — every time an app is opened, we will update the last time we saw that subscriber. On the read side, we support both transactional sending (ie, send to a particular subscriber at a particular engagement milestone) and sending to large audiences with specific characteristics (i.e. segments). When notifications are sent to a large segment defined by a variety of parameters, the queries can quickly become complex and take minutes to execute because they may be returning millions of records from a set of tens of millions.

Second to subscribers, notifications is our next largest data set. Records vary greatly in size from very small (e.g. a “send to all my users” notification) and very large when they include lists of specific subscriber IDs. The bulk of these records are written once upon creation, and then various counters and timestamps are added or updated throughout the delivery process. There are very few reads of this data — almost all access post creation is a targeted UPDATE or a query from the OneSignal Dashboard to get an overview of the most recent notifications. There are also occasional exports of the notification data for client apps, but these constitute a very small portion of access. Finally, we run bulk deletes on this data to enforce retention policies. The notification data set is partitioned and sharded similarly to subscribers.

There are some other data sets consuming relatively large amounts of space (~10 percent of overall data stored), but they are less interesting from a dynamics perspective.

In summary, there are two data sets we’ll reference in this article:

subscribers is write-heavy both in terms of INSERT and UPDATE and has the added challenge of frequent, long-running analytical queries to support delivering to segments.
notifications often has quite large records in addition to a heavy UPDATE load and frequent bulk deletions to enforce retention policies.

Bloat

Let’s talk about bloat. First off, what is it? In a broad sense, software bloat is a term used to describe the process whereby a program becomes slower, requires more hardware space, or uses more processing power with each successive version. There are a couple of different kinds of bloat within PostgreSQL.

Table Bloat

Table bloat is disk space consumed by dead tuples within a table that may or may not be reusable by that table and are not reusable by other tables or indices.

Imagine you create a table and insert ten records that occupy one page of disk space per record, without going over. If you delete the first nine records, the space those records occupied does not become available for reuse! Those entries are now considered “dead tuples” because they are not observable by any transactions.

Now, running a VACUUM on this table would allow space within that table to be reused for future INSERTs or UPDATEs, but those pages won’t be available if, for instance, you have a second large table that could use some additional space.

Updates are another source of bloat in PostgreSQL because updates are implemented as a DELETE plus an INSERT. Even if deletes are not common on a data set, a heavily updated table can also become the victim.

So when is vacuum not a sufficiently good solution? This will depend on the shape of your data and corresponding access patterns. For some of our data sets, we started off persisting indefinitely or for a long period of time and later decided to add a retention policy. If such a policy causes the volume of stored data within a table to be reduced from 300GB to 10GB, running a vacuum would allow all that space to be reused by the table. If the steady state storage is around 10 to 15GB, most of that is wasted space. In this situation, using a vacuum won’t help you. For details on how to fix that, skip ahead to the discussion about pg_repack.

Index Bloat

Before trying to understand how index bloat is generated, let’s first review how PostgreSQL indexing works at a very high level.

PostgreSQL indices are direct indices — an index entry contains information about its associated tuple’s position on disk. Couple this with the fact that each UPDATE is effectively a DELETE plus an INSERT, and that means that every time a column is updated, the index entries have to be updated as well, regardless of whether the indexed value was changed.

But wait, there’s more! Due to PostgreSQL’s MVCC approach, the index entries can’t simply be deleted or updated; new index entries must also be added. This causes the same challenge that we encounter with table bloat — dead index entries accumulate over time as rows are updated and deleted. Because tables may have many indices, each write can cascade into many index writes, a phenomenon known as write amplification. The wasted space within indices caused by updates to the table is index bloat.

Before going further, I’d like to call out that there are cases and optimizations where dead space is not created, such as the Heap Only Tuple (HOT) optimization, which allows tuples to be stored adjacent to their previous version and for indices to not always need to be updated. However, HOT comes with some performance trade-offs that affect read performance for index scans.

Concretely

Back to our use case. We said that our subscribers data set is heavily updated and heavily read from. There are 21 indices, which means each update is creating roughly 20 dead entries. The net result of this is rapidly growing disk footprints for the table and its indices.

For notifications, we don’t have quite as many indices, but records are updated pretty frequently once they get to the delivery stage. Coupled with retention policy enforcement, this is a recipe for lots of bloat!

Preventing Bloat

“The best offense is a good defense” rings true when it comes to dealing with bloat. If you can avoid creating it in the first place, you won’t need any fancy solutions to get rid of it.

Tuning autovacuum

autovacuum is a feature wherein the database will automatically spawn VACUUM processes on your behalf. What is vacuuming, though? From the documentation:

VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.

From this description, we can surmise that, the more frequently a table is vacuumed, the lower the total storage needed for that relation will be. Vacuuming isn’t free though, and databases often have many relations needing attention. It’s important that you tune autovacuum to run often enough to keep dead space at an acceptable level.

Tuning autovacuum is a large topic deserving of its own article, and thankfully the great folks over at 2ndQuadrant have already written a detailed blog post covering this exact topic.

Schema Optimizations

The first optimization I'll cover addresses how to avoid creating bloat caused by a data retention policy. Using PostgreSQL table partitioning, you can turn one table into several tables and still have the appearance of a single table to your application. There are a few performance considerations to consider when executing table partitioning, so do your research before you begin.

Say that you have a date column in your data table such as created_at and you only want to retain the last 30 days worth of data. To do so, you could create up to 30 partitions, each with a specific date range that will be retained. When enforcing the retention policy, a simple DROP TABLE is used to remove a single partitioned table from the database rather than attempting targeted deletions from the table as a whole. This strategy can prevent bloat from being created in the first place. The pg_partman extension can even automate this process for you!

The next optimization is a bit more nuanced. Let’s say you have a table with two data columns, big_column and int_column. The data stored in big_column is typically around one kilobyte per record and the int_column is updated very frequently. Every update to int_column is going to cause big_column to be copied as well. Because these data columns are linked, updates will create a lot of wasted space, in the order of 1kb per update (modulo disk paging mechanics).

What you can do in this case is split the int_column off to a separate table. When updating it in that separate table, no duplicates of big_column will be generated. Although splitting these columns means that you'll need to use a JOIN to access both tables, it may be worth the trade-off depending on your use case. We use this trick both for the subscribers and notifications data sets. The data tags on subscribers can be multiple kilobytes and columns like last_seen_time are updated very frequently. This significantly reduces bloat rate.

Reducing Bloat

Sometimes bloat happens despite our best attempts to avoid it. Thankfully, there is a wide array of tools both first and third party for dealing with it.

VACUUM FULL

FULL is one of the possible arguments for the VACUUM command. According to the PostgreSQL documentation:

Selects “full” vacuum, which can reclaim more space, but takes much longer and exclusively locks the table. This method also requires extra disk space, since it writes a new copy of the table and doesn't release the old copy until the operation is complete.

They go onto say that this option should only be used in extreme circumstances where lots of space needs to be reclaimed. Indeed, this is the nuclear option. It rewrites the whole table along with all of its indices. The resulting table will have no wasted space, but it comes at the cost of an exclusive lock on the table while the operation is in progress, and this can easily lead to 503 responses from your HTTP layer due to queries timing out while waiting for that lock to be released. DO NOT do this unless you absolutely know what you are doing.

pg_repack

pg_repack is a 3rd party tool built specifically to combat the problems of bloat for which automatic vacuuming is unable to address. There are a couple of different modes of pg_repack, including:

table repack, including all indices.
index repack, specific index.

pg_repack’s table repack function is effectively VACUUM FULL CONCURRENTLY. This means that, rather than holding an exclusive lock on the tables for the entire duration of the operation, an exclusive lock is only used at the very end when replacing the original table with the newly created, bloat-free table.

The index repack mode involves concurrently creating a new identical index to the one being repacked and then replacing the old index once the new one is ready. You can think of it like a REINDEX CONCURRENTLY (which is actually available in PostgreSQL 12).

Most, if not all, of the usage stats, etc. are copied over from the original relation for both modes. However, there are a few drawbacks:

Despite limited locking, table repacking can still be problematic on tables with long running queries and frequent transactional access. If you’re not careful, this tool can still restrict table availability.
Repacking is akin to vacuuming in that you can point it at a relation and have it run. However, unlike vacuuming, there’s no repack daemon to handle automatically repacking bloated relations. This means that it’s a manual process by default.
As part of the table repack, updates to the original table are stored in a log table and applied later. The end of this process is single threaded, so if your write rate is high enough, you could end up in a situation where the repack is never able to finish because the log fills up faster than it can be processed.

We use pg_repack extensively at OneSignal to manage bloat. To get around the lack of a repack daemon, we built one. The daemon currently only automates index repacks, but we hope to extend it to support tables in the future. Index repacks are generally safe because they don’t impose a heavy-weight lock on the table whose index is being repacked. The daemon functions by scanning for indices which appear to be bloated and then simply repacking the next one on the list, ad infinitum.

We don't automate table repacking because our subscriber tables are very risky to repack given the frequency of long-running analytical queries. However, we would love to use this method for our notification tables, which are generally safe to repack. In order to do so safely, we would need to alter our current table configuration to prevent specific tables from being repacked.

There’s one other feature we would love to have, but it’s blocked by either pg_repack or PostgreSQL itself. We’ve found that concurrent index repacks will block one another and cause one of the repacks to fail. Although this is easy to recover from, it means that the repack concurrency is limited to one.

The daemon we created detects when a table is blocking traffic (i.e. high waiting queries) or a database backup is running, and will automatically cancel/idle during those times. We hope to open source this daemon in the future, but we don’t feel that it’s generally ready for external consumption.

pgcompacttable

Finally, we come to pgcompacttable. This tool is excellent for reducing table bloat in a completely nonblocking manner. It works by reordering tuples from the end of a table towards the front of a table, which allows the table size to shrink. We use this on our subscribers tables in place of pg_repack because the latter often causes availability issues. pgcompacttable is slower than pg_repack, which is why we don’t reach for it as a first line of defense.

Database Upgrades

Major Upgrades

Major upgrades of PostgreSQL are used as opportunities to change the on-disk format of data. In other words, it’s not possible to simply turn off version 12 and turn on version 13. Upgrading requires the data to be rewritten in a new format.

There are two upgrade approaches that provide different allowances for service availability. The first option is pg_upgrade. This tool rewrites the database from the old format to the new. It requires that the database is offline while the upgrade is occurring. This requirement is a big problem if you have even a modestly sized data set and availability requirements — which is why we've never used this method to upgrade our database.

Instead, we use logical replication to perform our major version upgrades. Logical replication is an extension of streaming replication typically used for hot standbys. Streaming replication works by writing raw disk block changes from the upstream server to the replica, which makes it unsuitable for performing a major upgrade. The reason why logical replication can be used is that changes are decoded and applied as if a stream of SQL statements was sent to the replica (as opposed to simply writing the page changes to disk).

From a high level, the process looks like:

Set up a new server with the upgraded PostgreSQL version.
Set up logical replication , effectively creating a hot standby on the new version.
Either cut over or gracefully switch to the hot standby. To achieve a graceful switch, the pglogical extension offers more knobs to tweak how replication stream is applied and how conflicts are handled than the built-in logical replication functionality.

There is one major caveat, though. The decoding process on the destination database is single-threaded. If the write load on your database is high enough, it will overwhelm the decoding process and cause the lag to increase until some limit is reached (typically, available disk space).

If you find yourself in the situation where logical replication can’t “keep up,” you basically have one option: move data to another database one table at a time (using logical replication because it supports this type of fine-grained replication). The replication target could be on an upgraded version of PostgreSQL. This means that your application must be able to select different databases for different tables and requires you to handle the switchover in application code.

To get started with logical replication, I recommend first reviewing the official PostgreSQL manual and also check out the pglogical extension, which provides more sophisticated control over conflict resolution under logical replication.

Minor Upgrades

Minor upgrades are almost a footnote after the major upgrade section. Minor upgrades can be performed simply by updating the PostgreSQL binary and restarting the process. Most SLAs will provide plenty of buffer to support a quick restart of a database process. In cases where you need extreme availability, streaming replication and switchover may allow zero downtime minor upgrades.

Process restarts take in the order of tens of seconds with our data set, server size, and load, and we take this simple approach to keep our databases on the latest minor version.

XID Wraparound

Early on in our journey, another issue caused some loss of service: a failure mode known as transaction ID (also referred to as TXID or XID ) wraparound prevention.

PostgreSQL’s MVCC implementation relies on a 32-bit transaction ID. That XID is used to track row versions and determine which row versions can be seen by a particular transaction. If you’re handling tens of thousands of transactions per second, it doesn’t take long to approach the XID max value. If the XID counter were to wrap around, transactions that are in the past would appear to be in the future, and this would result in data corruption.

The phrasing “max value” is simple, but the concept is a bit more nuanced. XIDs can be viewed as lying on a circle or circular buffer. As long as the end of that buffer does not jump past the front, the system will function correctly.

To prevent running out of XIDs and avoid wraparound, the vacuum process is also responsible for “freezing” row versions that are over a certain age (tens of millions of transactions old by default). However, there are failure modes which prevent it from freezing extremely old tuples and the oldest unfrozen tuple limits the number of past IDs that are visible to a transaction (only two billion past IDs are visible). If the remaining XID count reaches one million, the database will stop accepting commands and must be restarted in single-user mode to recover. Therefore, it is extremely important to monitor remaining XIDs so that your database never gets into this state.

The remaining XIDs can be fetched with this query:

SELECT power(2, 31) - age(datfrozenxid) AS remaining
FROM pg_database
WHERE datname = current_database();

We trigger alerts if this value is ever less than 250 million. One of the most important autovacuum parameters influencing vacuum freeze is autovacuum_freeze_max_age. This is the maximum XID age required before a forced vacuum. The number of remaining XIDs when vacuums are triggered is 2^31 - autovacuum_freeze_max_age.

We have gotten into or near XID exhaustion in the past when a wraparound vacuum appears to be stuck running indefinitely. We think this was due to some extremely long-running transactions on the relation being vacuumed, but we can not say conclusively why this occurred. We have not run into this issue for a few years and it’s possible that some of the more recent changes to autovacuum have fixed the issue we encountered. Nonetheless, we keep tabs on this number because of its ability to force a database out of service.

Regretfully, we didn’t post a public post-mortem of our incident with this failure mode. The great folks over at sentry.io actually experienced a similar issue several years ago and wrote a fantastic post-mortem which covers the same issue and recovery in more detail.

For additional information, check out the manual section about preventing XID wraparound failures. If you plan to operate PostgreSQL yourself, I would consider this section of the manual required reading.

Replica Promotion

There are times when you need to promote a replica, possibly emergent. Perhaps you are performing a planned major upgrade, or perhaps you’ve just had a hardware failure on your write node. In either case, you don’t want the promotion to take long.

The way we achieve this is by putting our databases and its replicas behind haproxy. We configure two backends for each logical database: one that is read-write and one that is read-only. The health check for the read-write backend identifies if the database is in recovery mode, indicating whether it’s a replica or not. The read-only pool simply checks that each server is streaming and not lagging. A promotion in this configuration must indicate to haproxy that the current read-write node is unhealthy and simultaneously promote one of the other replicas. This entire process takes a couple of seconds.

This section wouldn’t be complete without mentioning Patroni, which does a lot of the lifting for you to provide a High Availability (HA) PostgreSQL cluster. We don’t use the tool, but if we were starting from scratch today, we'd strongly consider it.

Partitioning

Large tables can be problematic for a lot of reasons, including: disk size, CPU count to serve queries, time taken for index scans, time taken for autovacuums, your ability to manage bloat, and so on. To address these issues, you can partition your tables.

Newer versions of PostgreSQL come with great support for splitting up your tables using their built-in partitioning feature. One advantage of using the built-in support is that you can query one logical table and get results, or split data between a number of underlying tables.

We currently partition both our subscribers and notifications data by tenant across 256 partitions. Due to increasing scale, we will likely be upping that to 4096 partitions — both for reasons of wanting to utilize more servers and also for improving the efficiency of queries and maintenance processes.

When we first increased from 16 to 256 partitions, we considered making the jump immediately to 4096. However, all of the partitions were living on a single server at the time and there were some concerns about file system performance with so many files in a single directory. On modern file systems like EXT4, this isn't an issue because a hash index is used for the directory contents (compared to older file systems which use an unordered list). We have always used EXT4, so this turned out to be an unfounded concern. If you find yourself growing quickly and needing to partition, I recommend creating a lot of partitions upfront to save yourself some trouble later on.

Sharding

Sharding is a natural extension of partitioning, though there is no built-in support for it. Briefly, sharding means to split your data across multiple database processes, typically on separate servers. This means more storage capacity, more CPU capacity, and so on.

Typical applications only have a few big data sets, and thus only those data sets need to be sharded. You might first partition those data sets and then spread those partitions across a number of servers.

If you have more than one application, it's generally a good idea to keep knowledge of the database topology out of your application (including both partition and shard level). We didn't do this initially and we're still dealing with technical debt from that decision. We are, however, making big strides towards creating a data proxy that is the sole application aware of the partition and shard topology.

We originally sharded our partitions by tenant ID — a process that was handled deterministically based on ID ranges. Our tenant IDs are v4 UUIDs, which means that we have roughly an even distribution of tenants on each partition. This was sufficient early on, but we now want the flexibility to move partitions around as part of incremental database upgrades and to isolate larger tenants. The data proxy initiative we have in-progress will support this in the future.

Conclusion

We covered a lot here — and only from a high level. The crazy part is that there’s a lot more we could have covered. The topics we selected are close to our hearts. If there’s another PostgreSQL scaling topic you’re interested in hearing about or have PostgresSQL scaling questions, please don’t hesitate to reach out to let us know. If we don’t know the answer to your question, we will try and direct you towards people who do.

Join Our Team

We're hiring Engineers! If you're interested in joining our team, we'd love to hear from you. Check out our current job openings for more information.

DEV Community