Kasey Speakman

Posted on May 16, 2018

Fixing relational DB mistakes: wrong keys

#fix #data #modeling #sql

We have a number of legacy applications which are integrated through a database. If an Integration Database anti-pattern that wasn't bad enough, inexperience led to a design which was not modeled well. Here are a couple of things we fixed and the process we used to do it.

Our scenarios

The first application was written as a companion to the accounting system. So it used data straight out the the accounting database. As it evolved, it had its own database for certain things, but still depended on the data from the accounting program.

Today we are in a situation where new courses must be entered first in our accounting system, then synced over to our system database. We want to reverse this dependency (enter the course in our system and then later sync it to accounting or better yet just make it a line item on generated invoices), but the accounting database ID is what we use as the primary key. And every table which references Course also has its accounting ID. The accounting ID can only be generated inside the accounting system when a new Course is added. So it is not easy to fix. We will have to switch over to using our own ID column that we control.

This technique also works in another case: where we have used the wrong column(s) as the key on a table. For example, there is one table where we always join against it using 5 columns. In the child table these columns are only used to match the parent table. It could easily be replaced by a surrogate key and the columns could be eliminated in the child table. In another case, we are using a name as a key, and this name could change. We want to use an ID instead which is not user-changeable. (If we need a user-changeable ID it will be addition to our ID.)

The process

This process does its best to ensure that the changes can be deployed without breakage after each step. It is a good idea to work with one child table (or family of tables) at a time. Additionally, I suggest using a test environment (e.g. a dev's machine) with a copy of the production data and a copy of the applications which are pointed to the test database.

We are actually in Step 7, so I will update this guide if we learn any significant lessons between 7 and 8.

🛡️ As ever, backup your production database before applying changes. And it doesn't count as a backup process until you have successfully proven the "backup" can be restored.

Asinine legal stuff because short-sighted, abusive people exist: This is provided as-is. There is no warranty. I am not responsible for data loss or quantum singularities that may occur subsequent to (or simultaneously with) reading (or any other method of putting into your brain pan) this guide.

Step 1: Create a new ID column

Parent table

Assuming you do not already have a column you can use as the new ID. You must have a column to use to identify each entity. In our example, we created a UUID column named CourseId on the Course table. We set it to NOT NULL, and use a DEFAULT constraint which generates a new UUID if none is present. For MSSQL the statement looks like this.

ALTER
TABLE Course
  ADD CourseId uniqueidentifier
      NOT NULL
      DEFAULT newid()
      CONSTRAINT UK_Course_CourseId UNIQUE
;

The Accounting ID will stay as the primary key until we have completed the change-over. But we want to still guarantee that CourseId is unique until then.

Child tables

The situation is a little different with the child tables. We aren't going to know what CourseId should go into them at the time that we add the column. So we must allow NULL values for now. We are going to skip creating indexes for now. This is simply going to provide a place for the CourseId to be stored. The MSSQL statement looks like this.

ALTER
TABLE Registration
  ADD CourseId uniqueidentifier
;

⚠️💣 If you have any applications which SELECT * and access columns by index, you may want to double check that this does not break them. If using an ORM, you may also have to update its configuration to include the new column.

Step 2: Change applications to populate the new column

So after Step 1, we have the parent table (Course) populated with the new IDs. But CourseId is NULL on every row of the child table (Registration). We could back-fill the CourseId on existing Registration records. But new records would still get written with NULL values, because applications are not yet aware of it. So we need to make sure that any writes to the child tables are filling in the new CourseId column.

Usually application data is read a lot more frequently than it is written. In our case, we have a handful of places where Registrations are created versus the plethora that read them. So this change was relatively small in scope.

We had to go through the applications and modify INSERT code to also set the CourseId. In some cases, that data was not being provided by the client. So we also had to change the client to pass that data in. These changes are generally easy to test and not very risky. But what you have to do in your applications may vary.

So at the end of this step, the data is not actually being used for anything. But we are just making sure it is getting populated by any application which writes to our child table.

Step 3: Verify all writers are populating the new column

It could be that we missed some applications which write to our child table. For example, maybe it runs automatically on a schedule so we forgot about it. Or maybe it is an integration with another system that we have not touched in 6 months, so it slipped the mind.

So my advice here is to wait some time. Keep a watch on the child table to see if any recently-written records have a NULL id column. If so, track down the application which wrote to it, then repeat Step 2.

Once you feel confident that all writers are populating the new column, proceed to step 4.

Step 4: Back fill the NULL values

At this point, we have the new column being populated in the future, but many past entries still are NULL. So we write an UPDATE statement to fill them in. The records are linked by the improper key currently (CourseAccountingId here), so we can use it to populate the new key.

UPDATE Registration
   SET CourseId = c.CourseId
  FROM Registration r
  JOIN Course c
    ON c.CourseAccountingId = r.CourseAccountingId
 WHERE r.CourseId IS NULL
;

Step 5: Add constraints on the new column in child tables

⚠️💣 If you missed any applications in Step 2, they will start failing after this step. So you probably want to perform one last check that no records have NULL in the new column.

So now we have the column getting populated in the future and all the old data filled in. But we need to guarantee data integrity after new features are implemented later. So we need to add whatever constraints and indexes make sense.

In our case, we first make the column required.

ALTER TABLE Registration
ALTER COLUMN CourseId uniqueidentifier NOT NULL
;

One-to-Many child table

We need to add an index that will be needed for many of our queries, and especially JOIN statements.

CREATE INDEX IX_Registration_CourseId ON Registration (CourseId);

One-to-One child table

In the case of 1-to-(0/1) child tables, you want to make the new column a Primary Key instead of adding a normal index. Here is a query to remove a previous primary key.

ALTER TABLE ChildTable
DROP CONSTRAINT PK_ChildTable
;

You probably want to add a unique index on the old column to ensure good performance for apps which are still using this value for queries.

Here is a statement to change the new column to have a primary key.

ALTER TABLE ChildTable   
ADD CONSTRAINT PK_ChildTable PRIMARY KEY (ParentId)
;

If your child table has any duplicate records, adding this primary key will fail.

Fixing one-to-one child table data

If your child table was supposed to have a 1-to-(0/1) relationship, but you weren't enforcing it with a UNIQUE/PRIMARY key on the old id, then there is a high likelihood that you have duplicate records where there should only be one. So you won't be able to add the primary key until you perform (dangerous!) fixes. How you fix the duplicates is up to you. You could, for example, remove all but one or merge them. I can't help you there, but I can provide a query to help find duplicates in a table where only one of something should exist.

SELECT c.*
  FROM ChildTable c
  JOIN (

-- Get IDs of duplicates in child table
SELECT ParentId
  FROM ChildTable
 GROUP BY ParentId
HAVING COUNT(*) > 1

       ) d
    ON d.ParentId = c.ParentId
 ORDER BY c.ParentId
;

Step 6: Add foreign keys

Now we have guaranteed that the new column is being populated (or else the INSERT will fail). However, it could still happen that a (parent) Course is deleted, which would leave orphan records in the (child) Registration table. So we need to create a foreign key to ensure that the Registration table cannot contain orphans. Here is an example statement to add a foreign key in MSSQL.

ALTER TABLE Registration
ADD CONSTRAINT FK_Registration_Course
       FOREIGN KEY (CourseId)
    REFERENCES Course (CourseId)
;

If the child table has any orphan records, creating this foreign key will fail.

Fixing Data

If you previously didn't have foreign keys on the old id, there is a pretty high likelihood that you have some orphans in the child table. So before you can actually add foreign keys, you have to fix the orphans (remove or link to a new parent). The specific process to do this will vary based on your situation. But I will show you a query that will help find orphans in the child table. This is the MSSQL version.

SELECT r.*
  FROM Registration r
  LEFT
  JOIN Course c
    ON c.CourseId = r.CourseId
 WHERE c.CourseId IS NULL
;

On the face of it, this query doesn't really make sense. But what is happening is we are starting from the Registration table. We are using a LEFT JOIN, which makes sure that every record from the Registration table is retained in the results even if there was no match in the Course table. In those cases where there is no Course match, all columns from the Course table will be NULL. So then the WHERE clause kicks in and says only show me the records where the CourseId was NULL (no match).

Step 7: Fix applications to read the new column

For us, this is the bulk of the process. We have many applications and many different queries against the data. We have to go through and change them all to read from the new column. We also need to make sure we stop referencing any old columns. Remember to update your ORM configuration to remove the old column if applicable.

Test. Test. Test.

Recognize that there are applications (sometimes called Tier 1) which will have a significant impact to your business if they are down. Spend some extra time to go through these thoroughly. But also understand that there are applications which have limited effect on your business (for example, email notifications which will just be delivered later when they fail). If I can be real for a second, for non-critical services it may be faster to deploy them and see exactly how they fail.

After all your affected apps are updated, verify that none of the applications are unexpectedly accessing the old column. One way to do this: in your test environment, just drop the old column (even if you want to keep it in production) and run the programs. Unless you are using the old column on purpose -- in our case, we now only use it to sync to the accounting software -- then all the programs should operate normally without it. And you can just ignore the failure in apps that knowingly depend on the column.

Step 8: Remove or change the old column

Finally we are where we want to be. In our case, we want to change the old column to no longer be a primary key and also to be nullable. Then make our new column the primary key. That will allow us to have full control over the course data, and only populate the accounting ID when we push course changes into the accounting software.

⚠️💣 If there are any applications which haven't been updated to use the new column, they will start failing after this step.

Cleanup

Remove indexes on the old column in the child table:

DROP INDEX IX_Registration_CourseAccountingId ON Registration;

Remove the old ID from the child table:

ALTER TABLE Registration
DROP COLUMN CourseAccountingId
;

Swapping keys

Remove the primary key from the old ID on the parent table.

ALTER TABLE Course
DROP CONSTRAINT PK_Course
;

Remove the unique constraint on the new ID (because we will change it to a primary key).

ALTER TABLE Course
DROP CONSTRAINT UK_Course_CourseId
;

Add the primary key on the new ID.

ALTER TABLE Course   
ADD CONSTRAINT PK_Course PRIMARY KEY (CourseId)
;

I do not use CLUSTERED here, since this is bad for random UUIDs.

Make the old ID nullable.

ALTER TABLE Course
ALTER COLUMN CourseAccountingId varchar(255) NULL
;

Add a unique index onto the old ID because we use it for our accounting integration. This MSSQL version requires a conditional index, but other databases may use a standard unique constraint.

CREATE UNIQUE INDEX UX_Course_CourseAccountingId
    ON Course (CourseAccountingId)
 WHERE CourseAccountingId IS NOT NULL
;

The old accounting integration continues to work, but now we have everything in place in the database to take Course ownership away from the accounting system.

Summary

We walked through the step-by-step process of performing major surgery on our database to try to improve overall system health. All the while we try to perform our surgery in such a way that the patient can keep working after each step. It is not easy, nor quick. So the benefits better be worth it. In our case, having to enter data in the accounting system first is a major blocker to many use cases we want to implement. So the cost is deemed worth it.

I'd like to note here that drastic amounts of time would have been saved if things had been thought through, filtered through the lens of experience, and decided correctly the first time. But experience (and the budget it would require) was unavailable at the time the system started. In absence of experience, the unfortunate reality is that learning is iterative -- we usually don't know better until AFTER we make the mistakes and feel the pain. Even when the inexperienced use patterns "because somebody smart said so," they really don't know what they are doing and can apply patterns very wrongly. So I believe most of the mistakes couldn't be helped. And the fact that we now really need to fix it due to growth: "Good problems to have."

Top comments (3)

Martin Häusler • May 18 '18

When I see something like this, I get a very strong inner urge to run as far away as humanly possible and hide under the next rock. These things are no fun, I've been in similar situations myself, sometimes it cannot be avoided. Thanks for the thorough explanation.

Kasey Speakman • May 18 '18 • Edited

We avoided making this change for literally years because it was too hard. But once we sat down and mapped out how we could pull it off without taking the system down, it turned out it wasn't so bad. Definitely a lot of work, but it will let us alleviate many different pain points for our users. And package it for the cloud.

Barry O Sullivan • May 27 '18

Great article Kasey.

DEV Community