Several months ago we wanted to move an internal Rails application out of the Heroku ecosystem because it had grown too important and resource intensive to maintain there. This application (test-failures) collects, stores, and displays CI test results for a variety of our projects. Since we take a strong TDD approach in all teams at Doctolib, test-failures is the beating heart of all our devs work. The most critical—and trickiest—asset to migrate was our very large (>500GiB) and very active (>7000 commits per second) PostgreSQL database.
Since test-failures has become such an irreplaceable part of the day to day work for our entire tech team and in crisis situations, we wanted to minimize downtime as much as possible, even on nights and weekends. Testing showed that a full dump and restore of the database takes at least 16 hours! To migrate this way would require a full day of downtime, which would have been unacceptable. Obviously the ideal would be to use native PostgreSQL replication to mirror the old and new databases. However, we were disappointed (but not surprised) to learn that Heroku does not support PostgreSQL replication to another server outside of their infrastructure, in our case to Amazon Aurora PostgreSQL-compatible RDS. Not taking “no” for an answer, we decided to produce an applicative solution for migrating the data to our shiny new Aurora cluster without downtime.
The idea was to start by running a
pg_dump (primary, Heroku) →
pg_restore (replica, Aurora) during normal database operations. This will populate the replica database schema and bring it to a state close to the primary database. Next we need to continually update the replica database with the Δ from primary database. By leveraging the “updated at” timestamp for each row, we can query all rows which have “updated at” timestamps later than the start of a dump and insert those rows with an
ON CONFLICT clause to update with new values (a bulk “upsert” operation). Repeating the query again against the time we started the first upsert and looping we can achieve a result very similar to logical replication. This only works because we never
DELETE rows only
UPDATE them, otherwise we would have no record of what rows need to be deleted from the replica, however we would have been able to “soft-delete” rows by setting a “deleted at” timestamp and ignoring those rows in our main Rails application.
ON CONFLICT clause used to perform an upsert operation:
INSERT INTO employee (id, created_at, name, salary) VALUES (1, '2018-06-13 21:10:34.769555+08', 'Jane Doe', 123456) ON CONFLICT (id) DO UPDATE SET created_at=EXCLUDED.created_at, name=EXCLUDED.name, salary=EXCLUDED.salary;
In the end, this allowed us to keep a replica of the Heroku PostgreSQL database in sync with our Aurora database with less than 15s of latency under full load! Our required downtime was reduced from more than 8 hours to do a dump-restore offline to less than 5 minutes just to switch the database configuration in our application. The Go source code is available here, it was useful for us and I hope it can be useful for someone else too. There’s not full support for all modern PostgreSQL data types, but there was enough for us. Of course, I'd never suggest this over native replication unless you have no choice. Use it wisely if you use it at all ... and have a backup on hand. 😉