Shiv Sarthak Sabhlok

Posted on Nov 2, 2023

Surviving the Scale: Inside Spur's Database Partitioning Makeover with PostgreSQL

#postgres #performance #prisma #database

Introduction

In the fast-paced world of eCommerce, communication is the backbone that enables a seamless customer experience.
At Spur, we've always been at the forefront of this, using technology to make eCommerce communication not just efficient but magical. However, as our platform has grown, so too have the technical challenges we face. The scalability of our database, particularly the Messages table, became a bottleneck that needed immediate attention.

Behind the helm of this ambitious project were two dynamic individuals: Rohan Rajpal, Co-Founder of Spur, and Shiv Sarthak Sabhlok, Founding Engineer and the author of this article.

In this article, we'll delve into the intricacies of this scaling operation: from the challenges we faced in implementing partitioning with Prisma to the hurdles we overcame in migrating from our staging to production environments. By sharing our experiences, we hope to provide valuable insights for those looking to undertake similar scaling challenges.

Background

Here at Spur, PostgreSQL serves as our primary database, with Prisma as the accompanying ORM. While this duo is often the first choice for developers looking to quickly bootstrap an application—ourselves included—we've found ourselves in a love-hate relationship with Prisma. Alongside various performance concerns discussed in this blog, we discovered that Prisma initially seemed to hamper our table partitioning ambitions. In this article, we'll delve into our journey of implementing table partitioning for our Messages table at Spur. We'll cover the challenges we faced, the missteps we took, the solutions we enacted, and the significant performance gains we ultimately achieved. Buckle up for an exhilarating ride through the world of database performance optimization.

Inception

As Spur's user base expanded, so did the sheer volume of messages we processed daily. While this uptick was a promising indicator of our platform's growing appeal, it presented us with a dilemma—degraded query performance and extended load times.

The most pronounced impact was on our inbox module, which struggled due to the burgeoning number of rows in the Messages table and the way we retrieved the latest conversations for users. We resorted to SQL query refactoring and judicious index additions as immediate but temporary fixes. Even with these hotfixes, query times remained suboptimal, taking between one to two minutes for some users with extensive message histories.

Though we were acquainted with the theory of table partitioning—a feature that PostgreSQL provides to segment large tables into smaller, more manageable "partitions"—none of us had hands-on experience implementing this technique in a PostgreSQL environment.

Thus, our odyssey into the realm of database partitioning commenced, sparked by a simple message in an otherwise nondescript Slack thread.

El Plan

Given our lack of hands-on experience with table partitioning, we first set out to validate its applicability to our specific situation. Data analysis revealed that we were accumulating roughly 2 million messages each month—a number that would only escalate as we expanded.

In the ensuing days, we rigorously researched partitioning, grasping its mechanics and identifying the best way to adapt it to our use case. The implementation steps appeared deceptively straightforward:

Create a parent table as usual, specifying the partitioning key and method.
Create and attach child partitions to the parent table.

That's it and you can query the parent table like you normally would and Postgres will handle the rest.

Here's a basic example:

CREATE TABLE messages (
    id         int NOT NULL,
    createdAt  date NOT NULL,
    content    text
) PARTITION BY RANGE (createdAt);

CREATE TABLE messages_2023_01 PARTITION OF messages
    FOR VALUES FROM ('2023-01-01') TO ('2023-02-01');

Important considerations included:

Existing tables in Postgres cannot be directly partitioned.
Creating indexes on the parent table will also generate indexes for each attached child partition.
Unique constraints don't operate as usual on parent tables, owing to the inherent separation of child tables.

Taking all factors into account, we resolved to proceed with partitioning. While some could argue that query optimization and database refactoring might yield temporary gains, we believed that partitioning was an inevitable step, SPURred (pun intended) partially by our sheer enthusiasm to try something new.

Our implementation plan was crafted to ensure both the retention of existing messages and the successful insertion of new ones. We chose the createdAt date as our partitioning key and defined each partition's range as one month.

The Plan:

Rename our existing Messages table to Messages_Legacy.

ALTER TABLE Message RENAME TO Message_Legacy;

Create a new partitioned Messages table with the original name, partitioned by range on the createdAt column.

CREATE TABLE Messages (
   -- Columns
) PARTITION BY RANGE (createdAt);

Create and attach a new partition for the current month's messages, defining the range to encapsulate all messages for this specific month.

CREATE TABLE Messages_October_2023 PARTITION OF Messages
FOR VALUES FROM ('2023-10-01') TO ('2023-11-01');

Migrate the data from Messages_Legacy to this new current month's partition.

WITH moved_rows AS (
    DELETE FROM "Messages_Legacy"
    WHERE "createdAt" >= '2023-10-01' AND "createdAt" < '2023-11-01'
    RETURNING *
)
INSERT INTO "Messages_2023_10" SELECT * FROM moved_rows ON CONFLICT DO NOTHING;

Attach the Messages_Legacy table as a partition, covering the remaining date range, from the inception of the database to the end of the previous month.

ALTER TABLE Messages ATTACH PARTITION Messages_Legacy FOR VALUES FROM ('2023-01-01') TO ('2023-10-01');

Asynchronously move older messages from Messages_Legacy to their respective new partitions at a later stage, allowing us the flexibility to manage this data migration at our own pace.

The elegance of this approach lay in its near-instantaneous execution for all steps except data migration, thus promising zero downtime.

Implementation

We rely on Prisma for our database migrations. At the time of this writing, Prisma doesn't natively support table partitioning. However, it does offer a --create-only flag that allows us to inject custom SQL into migration files. Accordingly, we divided our partitioning strategy into two separate migrations:

Creating a partitioned message table.
Migrating data from the existing message table to the new partitions.

I initiated the process by writing migration scripts for both steps and running them on a local test database.

Here's how Prisma operates: it executes all SQL files against a shadow database schema. Subsequently, it introspects this shadow schema and compares it with the schema defined in the schema.prisma file. If they match, you're good to go. If not, Prisma automatically generates a new migration file to reconcile the differences. Our task was straightforward: implement partitioning such that it aligns with the table definition in the schema.prisma file.

Upon executing my migrations, Prisma calculated a schema diff and generated a new migration to resolve it. I was expecting this; mastering table partitioning on the first attempt is unlikely. When I inspected the new migration, I noticed it was creating additional indexes on the partitioned table, despite the fact that I'd already defined indexes during the initial migration.

This discrepancy occurred because Prisma tracks indexes and foreign keys based on their names. The old message table had indexes and foreign keys with identical names to those in the new partitioned table. To resolve this, I renamed the indexes and foreign keys in the old table before proceeding.

While this issue was specific to Prisma's tracking mechanism, it warranted attention.

During subsequent runs, another auto-generated migration tried to set a primary key on the 'id' column of the Message table. This presented a challenge: PostgreSQL partitioning rules forbid setting a primary key on any column other than the partition key or a composite that includes the partition key. We resolved this by establishing a composite primary key consisting of (id, createdAt).

This led us to an oversight in our table definition. Initially, our SQL looked like this:

CREATE TABLE "Message"
(
    id                       serial,
    content                  jsonb,
    "sentViaSpur"            boolean DEFAULT false NOT NULL,
    ...
)
PARTITION BY RANGE ("createdAt");

A fresh partition would introduce a new serial sequence, resulting in duplicate message IDs. Although Prisma's findUnique function loses its utility, we implemented a workaround using the existing sequence for message IDs to maintain a form of uniqueness. We also added an index on the id column, mimicking the performance benefits of a primary key.

The final table partition definition became:

CREATE TABLE "Message"
(
    id                       integer DEFAULT nextval('"Message_id_seq"'::regclass) NOT NULL,
    content                  jsonb,
    "sentViaSpur"            boolean DEFAULT false NOT NULL,
    ...
    PRIMARY KEY (id, "createdAt")
)
PARTITION BY RANGE ("createdAt");

Consequently, we updated the schema.prisma file to remove the primary key constraint from the id column and instead established a composite index on id, createdAt.

After another round of meticulous checks and tests on the local database, everything finally fell into place.

The Dry Run

Before deploying the migrations to our production database, we knew it was crucial to conduct a thorough test in a staging environment. Therefore, we spun up an instance on AWS using the most recent snapshot of our production database.

Executing the first migration was swift, concluding almost instantaneously as anticipated. The second migration, which involved transferring data to the new partitions, was the real litmus test. It took approximately 5.4 hours to complete on the staging database.

Simultaneously, we carried out tests to insert new messages into the system. Importantly, these operations were neither blocked nor did they fail during the migration process. This implied that we could carry out the migration without any service downtime. The only trade-off was that users would temporarily lose visibility into their historical messages due to the detachment of the original message table.

After the migrations were completed, the performance gains were immediately evident. The staging database was remarkably fast, with inboxes loading in less than a second. The application's overall performance saw a marked improvement. For us, this was a preliminary victory, confirming that we were on the right track.

Moment of Truth

All systems were go after a thorough round of testing the previous night. We strategically chose to merge the pull request and initiate the migrations on our production database at 9pm. This timing was selected to capitalize on lower user activity, with the expectation that the migration would complete before dawn, minimizing user impact.

The first migration executed flawlessly, and our production database was now partitioned. For the initial hour, I vigilantly monitored system stability. All signals were green, prompting us to adjourn for the night, planning to revisit in the morning.

Upon checking at 7am, the migration was still in progress. By this time, it had been running for approximately 10 hours—nearly twice as long as it had in our staging environment. AWS monitoring showed no excessive load on the database instance, leading us to extend our timeline. By 11am, we notified our users about the temporary message unavailability.

Through PostgreSQL query monitoring, we observed both row-level and advisory locks awaiting grants, specifically tied to the second migration. We hypothesized that frequent incoming writes to the production database were inducing a lock-wait cycle, thereby slowing down the migration process.

As a contingency, we decided to let the migration run until the next morning while also preparing a fallback plan. Upon waking, the first thing I did—post-eye opening—was to check the tmux terminal running the migration. After 36 hours, it was still chugging along, compelling us to halt the process.

Our remediation strategy involved a SQL transaction with three steps:

Detach the most recent message partition.
Reattach it with an updated date range, spanning from the date of the earliest message to the end of the current month.
Attach the old message table as a new partition.

After executing these steps without errors, we manually fixed the failed Prisma migration. Upon launching our application, all messages were intact and surprisingly, performance had already improved. Given that the latest message partition had accumulated a substantial number of records within the past day, this made the entire migration effort to move old messages seem somewhat redundant. Now, we simply needed to manage October's messages in two separate partitions.

In Retrospect

Our journey through this database scaling exercise was a humbling experience that reaffirmed the age-old adage: anything that can go wrong will go wrong, especially at scale. Fortunately, PostgreSQL proved to be an incredibly resilient and robust system; despite our numerous modifications on live production data, it maintained the data's integrity impeccably.

In a live production setting, numerous unforeseen factors come into play, making it distinct from a controlled staging environment. For instance, while we had row-level locks being established during query execution, they were also released without causing any data issues. However, despite the locks being released, the queries seemed to hang, emphasizing how variables like concurrent operations, lock contention, and resource limitations can significantly alter query behavior and performance in production, compared to staging.

So, the key takeaway is that while planning and testing are crucial, one must also be prepared for the unexpected complexities that only a live environment can throw at you. Scaling is never just a 'lift-and-shift' operation; it's a complex endeavor requiring contingency planning, real-time problem-solving, and a deep understanding of the technologies you're working with.

As a testament to our journey and the hard work that went into overcoming these challenges, we captured a moment of triumph. Here's our victory photo taken right after everything fell into place

DEV Community

Surviving the Scale: Inside Spur's Database Partitioning Makeover with PostgreSQL

Introduction

Background

Inception

El Plan

Implementation

The Dry Run

Moment of Truth

In Retrospect

Top comments (0)

Read next

Time Series Blob Data: ReductStore vs. MongoDB

Asgard saviour: Neurelo

PostgreSQL. How do you find potentially ineffective indexes?

Cooperation between Cloudflare Workers has become amazing thanks to RPC support