DEV Community

Cover image for Database constraints considered harmful?
Jon Lauridsen
Jon Lauridsen

Posted on

Database constraints considered harmful?

In relational databases it's common practice to create constraints to lock down the data and its relationship to other tables. So, a Users table might have an email column that’s marked as unique because everyone has their own email, right? And a Subscription table gets a non-nullable User column because every Subscription must be associated with a User. Yeah, that also sounds reasonable.

But I'm here to argue against constraints! To some that might sound heretical (or at least very stupid), but hear me out:

We use constraints for two reasons: For performance (it's a lot quicker to search a uniquely-indexed column), and to guarantee data integrity (so we physically can’t enter invalid data, e.g. the business requires a subscription to be tied to a user).

Here's the problem: We programmers often optimize too early, and SQL constraints don't actually express the business requirements!

Let me break that down a bit:

  • On the topic of performance, there are of course valid cases where constraints speed up queries by several magnitudes, but unless we see that performance problem and its subsequent remediation with a constraint we run the risk of prematurely optimizing. Fewer constraints would make the code more malleable, and create less friction for making subsequent changes to the database structure.

    All I'm proposing here is that we look for opportunities to defer database-optimizations until they are truly needed, to counteract our tendency for premature optimizations.

  • Constraints that ensure data integrity really are challenged by an inability to actually capture all business requirements. I mean, in theory database-constraints could fully express any business requirement because e.g. the PostgreSQL-dialect is Turing complete, but that'd be mad! We don't program in SQL, we program in C#, Python, Javascript, etc. Actual business functionality belong in our general purpose languages, not the data-storage language. So what constraints should we put in the database?

    I propose no business-constraints at all. Does that sound barking mad? The more I work with SQL the more I get bit by half-baked "business constraints" that almost-kinda-sort-of duplicates what the code already handles. What does it really give us to mark a Subscription's User column as non-nullable, when the code already ensures a Subscription is only ever created in relation to a User? It amounts to no help at all because constraints are simply not the actual source of truth.

    "But wait" I hear some protest, "it's so simple to add non-nullable to the User column". Yes, but it still does harm, because it fundamentally isn't the source of truth for the business requirement.

At this point we need to discuss the drawbacks of constraints: The more of them there are the more difficult the system becomes to test and adjust.

  • More difficult to test, because to test a Subscription we now also have to create a User, and associate them correctly. Even if a test has nothing to do with the User entity it must be created. And then imagine another handful of entities that have to be put together just right to create a Subscription, it quickly becomes very painful to work with! Then we might start hiding all that test-complexity in entity factories, but those are just abstractions on top of complexity that we have caused for ourselves. Why not get rid of the complexity entirely?

  • And more difficult to adjust because introducing new features ends up requiring careful migrations to keep the constraints synced with code: What if we actually can have subscriptions without a user? What if an email column doesn't have to be unique? If the data-storage layer imposes artificial constraints it slows down quickly experimenting with such ideas.

I think many of us create constraints because we were trained and/or told to do so. But none of us are realistically implementing 100% of business rules in SQL, so, where do we draw the line? Constraints are of course good when they give real, tangible performance benefits, but what would happen if we loosen up the system to let those entities be more malleable and easier to work with and test?

At the end of the day only you know your domain, so there are no general answers to my assertions. What I'm trying to get at though, is a certain level of dogma I've felt that sneaks into relational storage design, where it is taken as fact that we must create constraints that sounds business-logic-compatible because that's just how we do it. Instead, I propose ONLY adding a constraint where and when it is truly and provably necessary. That way databases are allowed to focus just on storing and retrieving data, which is the role they truly excel at.

Photo by Sam Moqadam on Unsplash

Discussion (19)

Collapse
tsvetanganev profile image
Tsvetan Ganev

Interesting take. I've heard the same arguments about foreign keys being evil because they degraded performance. If we follow this logic, we don't need RDBS at all and all applications should store their data in plain binary files. I agree not all software systems require strong data consistency/integrity but we can't scrap 50 years of DB development so easily. Critical systems that work with money in some form must be modeled strictly on the data layer as well as on the application layer.

Collapse
jonlauridsen profile image
Jon Lauridsen Author • Edited on

Yeah fully agreeing. All I really want is to challenge the dogma I’ve sensed at times of always creating constraints. FWIW if we sat down and you argued against foreign keys I’d be fine with that, I trust the code to handle business requirements and see no issue joining manually. I’ve done that before and found no issues.

I see where you’re coming from with your financial comment and that might be true: The data storage constraints can act as a kind of extra bookkeeping check, further enhancing correctness on top of test-automation. But I suspect I’d still want to see how far we can guarantee correctness with layers of test automation… it’d at least be an interesting topic to explore :)

Collapse
mauricebrg profile image
Maurice Borgmeier

In defense of constraints for data integrity: If you need to absolutely ensure that a given value is unique in a column (e.g., your email) and you're operating in an environment where multiple entities change the data (e.g., multiple threads/servers) at the same time, a database constraint is the way to go to avoid race conditions and other funky distributed systems behavior.

Constraints can make testing more complex and impact performance at scale. However, not any workload needs to be scalable to the moon (premature optimization), and often constraints can be a cheap (as in economical) way to solve a problem.

Collapse
fjones profile image
FJones

The database is not the tool to prevent race conditions. It's an easier solution, and works as a stopgap, absolutely, but vulnerability to race conditions on the data layer are an architectural flaw that should be remedied elsewhere.

I often notice this with asynchronous microservices - where the solution is an execution engine or message broker, not relying on the database to resolve race conditions.

Collapse
citizen428 profile image
Michael Kohl • Edited on

This sort of assumes that the main app is the only place where data enters the system. From 2010-2012 I worked on a Rails app where we ingested several GB of data each week from external sources and there was simply no point in instantiating ActiveRecord instances for every row of a CSV when a simple COPY FROM did the job in a fraction of the time. So in this specific case duplicating the most important constraints in SQL made a lot of sense.

Another app I worked on used data collection agents written in different languages that ran on other machines and sent data back to a central system. It was much easier to ensure at least basic data integrity on the DB layer than having different implementations with a potential for drift in each agent.

But I've also worked on a big distributed system composed of many microservices where everything was eventually consistent and there we did indeed not bother much with data integrity on that level but instead relied on reporting and reconciliation scripts.

what would happen if we loosen up the system to let those entities be more malleable and easier to work with and test?

Sounds like you'd enjoy using document stores/NoSQL databases. In the projects I worked on that used those I saw a lot more ad-hoc validations littered throughout the codebase. They weren't always the nicest projects (happens a lot to consultants that mostly work on existing apps), but it was something that stood out to me.

Collapse
moopet profile image
Ben Sinclair

I see your points, and even agree in principle, but disagree with most of them in the real world.

A lot of assumptions people have about anything are wrong. For instance, a lot of services use a unique email as an identifier, but there's no one-to-one mapping between people and email addresses. I have lots (ben@my.work.address, ben@my.play.address...), some people share one (support@my.work.address, thesmithfamily@example.com) and enforcing uniqueness then becomes a business problem instead or you'll get a buttload of users who're breaking your Ts&Cs.

In the real world, code gets deployed which may or may not have been tested well, and tests may or may not cover every eventuality. The QA user buys negative PI beers.

Without constraints, and especially without transactions, data will get corrupted. It's like using C to manage strings as character arrays. You'll screw it up at least once while you're writing your program and see some sort of delicious segfault. If you have the chance to use a string type which handles all that for you, why not do it?

The difference between my string example and the RDBMS of the question is that the database persists longer than your new code. Errors in it go unnoticed, code crashes part-way through processing a "DROP user" and you have dangling subscriptions. Then you need to write a new function called removeOrphanSubscriptions which you call on cron every night at the software witching hour.

And the problem with optimisations is that choosing to not use a constraint because it migh degrade performance is the optimisation, not the other way around.

Collapse
jonlauridsen profile image
Jon Lauridsen Author

I hadn't considered transactions as part of my argument, we're definitely in agreement that transactions form a backbone of reliable writes. There should never be a dangling Subscription because a service crashed halfway through, and if what I wrote implies I'm against transactions then please point out the offending text so I can fix it :)

I think we both agree that the correct place to prevent buying negative beer should be in the code, right? Maybe we'd argue exactly where and how to do it, but from what you write I think we'd both argue the code should be responsible. The code is the responsible system. You describe the case where a mistake in the code would let an error through that'd get caught in the database, and yes you're right constraints might catch such errors, but it's kind of based on luck: Because business requirements can't be (or certainly shouldn't be) modeled entirely in the database, there are no guarantees that errors in the code layer will be caught in the database. It is possible to get lucky and catch an error, but in my opinion & experience it can just as easily be a regex-constraint that's gone out of sync with the code and now produces hard-to-verify errors because of some edge-case where a certain combination of characters result in a database error. Hope is not a strategy, and if the code should be responsible I'm all for focusing all efforts on that layer instead of adding partial safety nets where they don't really belong.

Perhaps constraints as you describe it made a lot more sense when code was managed by different people than the database, because those constraints allow the database-context to establish rules to make sense of its own data. But if one team manages both code and DB? I'm really suspicious of the value a constraint adds in that context, other than taking up even more minutes to writing and verifying a migration.

That said I'm not here to tell anyone how to work, you make good points and as long as we both can work on systems we enjoy I'm all the happier :)

Collapse
aarone4 profile image
Aaron Reese

WOW!
Where to start....
in a properly structured RDBMS where business entities are modelled correctly, constraints are never going to slow down your application. There are plenty of other reasons (missing indexes, partitioning, non SARGable queries) but check constraints and foreign keys are not going to be it.
SOMEONE needs to be responsible for the integrity of the business data. I agree that overly strict RDBMS constraints can APPEAR to slow down development but in reality they do not; they are like the canary in the mine and act as an early warning system that the business entities are not correctly modeled. If you don't do it at the database level you have to do it somewhere else. Most business solutions these days will have multiple entry points: Desktop, mobile and external applications. It is now possible to move the integrity checks to the API layer if it is common to all applications but DO NOT leave it up to the front end devs to encapsulate the business rules.
Strict RDBMS integrity rules FORCE changes in business requirements to be properly considered and investigated. You can't create a Subscription without the User, but without referential integrity built into the database model who is responsible for checking that you can delete the customer and not leave the Subscription orphaned.
I am going to assume that your example of the unique constraint for a User being the email address was contrived, but I fear it may not have been. There is a difference between the entity key and a unique constraint although there are often interchangable. the entity key should be unique AND idempotent (i.e. cannot change) AND ubiquitous (i.e. every entity must have one). For many reasons this will often become a NONCE key; either a UUID or an IDENTITY INT; it has no business significance outside the database. This fundamental failure of understanding is why you MUST engage a database architect and not leave entity definitions to the developers (sorry devs, but strings aren't strings. They are CHARS, VARCHARS, NVARCHARS, TEXT and have a length and a collation - otherwise you waste space and performance in my database). As a Database guy, I would not expect to know all the subtelties of your programming domain so why should you expect to know all the GOTACHAs in mine.
At the end of the day it will be the database people that are tasked with sorting out any crap data that creeps in through the front end and API layers.

NOONE puts data directly into my tables, all APIs call stored procedures for C_UD and will use a stored procedure or view for the R(ead). This is the only way to really validate that the data being entered is fully kosher. You say that we don't do turing complete code in SQL - Sorry. I do (MSSQL) and in the Oracle world PL/SQL is specifically designed to do this.

Collapse
fjones profile image
FJones

A few points:

  • Performance is indeed a touchy subject, and I do agree that database constraints should (at least theoretically) have similar or even better performance than application layer constraints - because the database is in fact optimized specifically for that.
  • Modelling business entities is not the job of the data layer. I know, people will disagree on this, but I firmly believe that the data layer should not be opinionated about business logic. It is storage, and it exists to model data structures, not business structures.
  • As I said in my other comment, the constraints the data layer should be responsible for are relationships (in a relational DBMS), and internal integrity (that is, valid types, limits, and so on).
  • While I agree that a lot of people eschew an API layer, that's bad practice as well. There should absolutely be an API core that models the actual business logic.
Collapse
jonlauridsen profile image
Jon Lauridsen Author • Edited on

It is now possible to move the integrity checks to the API layer if it is common to all applications but DO NOT leave it up to the front end devs to encapsulate the business rules.

Fully agree here. I didn't even realize I was implying anything but keeping the API layer responsible for business requirements.

As for the rest, I really do like your clear angle and rebuttal. I think fundamentally I oppose the sentence "my database", because I want to break down silos and optimize for feeling ownership of the product, not any of the subsystems. That's my positive spin on my angle, but you could just as well say I'm an API-centric chauvinist who wants the world to revolve around the API and you wouldn't be entirely wrong.

Anyway,
I think we're fundamentally pitted against each other because I kinda want developers to not care about CHARS, VARCHARS, etc. To me those are details that distracts the code from being focused on just modeling the business requirements.

Thankfully its fine we're "pitted against each other" because I love hearing your angle, it's so different from mine and I can only learn from that. It err… might be best we don't work on the same project 😅, but thank you for taking the time to elucidate your angle.

Collapse
aarone4 profile image
Aaron Reese

Its not a war, at the end of the day we are all on the same team trying to provide value to the business. My point is that constraints in the RDBMs database are there for a reason; it prevents rogue application code from destroying fundamental business data. If you are not going to use them you might as well use a document database or flat text files. Just like your C# / java / python classes are trying to model the real world, so is the relational database. As it is the last line of defense, it HAS to be correct. If front end and middleware developers are creating workarounds then this is an indication that the database model is out of date.
I once did a lignting talk at our local geek-meet about how code-first ORMs (specifically Entity Framework, but the same applies to Eloquent) can help you to rapidly prototype the application and supporting database, but the data types it uses are not optimised (unless you use Fluent, in which case you might as well use DDL in Management Studio) Using a repository layer and Active Record pattern can help but it is still no substitue for a properly configured database.

Some of the crap I have seen:
Dates stored in datetime fields.
Date and time stored in SEPARATE datetime fields.
dates stored as strings
Y/N held in VARCHAR(1000)
Phone number 1, Phone number 2, Phone number 3 ....

Collapse
jeremyf profile image
Jeremy Friesen

I used to work in insurance, and the very thought of having data integrity enforced at the application layer sends shudders down my spine.

Why? Because what happens if two duplicate emails exist in the users table for the same human being? It creates problems. Which user record gets the bill? Are there two different bills?

I advocate for validations at the layer where if they were missing, they would cause the most mayhem.

Collapse
fjones profile image
FJones

I definitely agree with keeping business constraints out of the database. One common issue we face with business constraints is testability. "This should only run once per day" translates to "let's add the date to our unique key". As a result, we can't run the same process more than once per day, because it clashes with the business constraint in the database, while the code path for testing bypasses those checks intentionally.

Database constraints should verify internal integrity. Even not-null checks are, at times, a hindrance both in operation and in changes of logic. Ensure that relationships are valid, ensure that types are valid - don't go beyond that.

Collapse
jonlauridsen profile image
Jon Lauridsen Author

I hear you state "With constraints you can trust the data".

But I don't think that can be true outside trivial cases, because business requirements aren't enforced in the database. You can constrain some business requirements (e.g. Subscription requires a User), but such constraints are simultaneously non-exhaustive (so the code still needs to validate if you don't trust what it's in the database) and aren't authoritative (because the code is where business requirements are supposed to be modeled).

If instead we say the database's role is only to store data (that is: a single service reads and writes from it), then there is plenty trust in the data because you know exactly what the service writes. A lack of trust there implies someone is messing with production data manually, in which case you either need to change that culture or add validation to all read-actions to ensure business requirement correctness (because database constraints would at best give a shallow appearance of correctness but might stil fail catastrophically inside the code because constraints don't model all business requirements).

I'd heavily suggest changing the culture though.

Collapse
joelbonetr profile image
JoelBonetR • Edited on

Didn't read all the comments (sorry if I duplicate a concern here).

A priority in software is robustness.
You can have a bug executing a query (so you'll need to rollback).
You can have a bug in code that alters the data wrongly.

Having a layer to ensure robustness in each step is preferrable, specially when working with users that pay you money.

You can define a model that can or cannot fullfill the formalization (I would expect, at least 1, 2, 3 and BC), but those normal forms are here for a reason plus they imply constraints.

If you really need to do something fast you can just use an ORM (Code First) without specifying any constraint and migrate in a single command instead doing the create tables in SQL commands.

I would anyway encourage you to analyze the business needs for each feature before coding a single line or executing a single query.

Analyzing the business needs you can find inconsistences and you can define and get to know the big picture and the forecast for the roadmap so you can take early decisions. Let's use the email column as example.

You can state that every user will have an email and that emails are unique and it will be a good assumption, but have you asked and defined with client if they are about to let (at any point in the future) the users to change their email from their account?

If it's a yes, simply add an auto increment ID and let the email as required but not as PK.
If it's a no, then you can use email as PK or ignore the client statement and do the thing defined in the line above, it will work anyway and you protected your future yourself from a possible bad day throwing alter tables and sync-ing the code with the new model.

Collapse
jonlauridsen profile image
Jon Lauridsen Author

If I may push back a bit, what does it matter if the client may consider letting users change their email some time in the future? If they say yes, would you design for functionality that isn't needed now? What if they say yes, but never get to it? What if they say yes, but actually what they need next year is something else? What if they say no, but actually meant yes?

We can both agree it's crucial to talk to our customers :), but I'll assert that in the majority of cases a discussion about "any point in the future" is too vague and too theoretical to be of use. I'd rather suggest we build what is needed right now: A User table with an email column. That's it. In data-storage terms I don't care if it's unique or not as I'll handle that in the more flexible code, all I know in data-storage terms is I need to store an email and right now the need is for it to be associated with a User. No constraints to mimic business logic, no concessions for theoretical futures, just what's needed to make the app work right now.

Collapse
aarone4 profile image
Aaron Reese

I don't care if it's unique or not as I'll handle that in the more flexible code,

Now you have to write it several times if there are multiple entry paths to the database

all I know in data-storage terms is I need to store an email and right now the need is for it to be associated with a User

So let's model the database with a UserContactMethod table:
ID (Primary Key)
UserID (Foreign key to the User table)
Contact method (enumerated value of phone/email/Twitter handle)
ContactMethodValue

Not a lot of effort to design up front and now caters for 0..N email address against a single User, the same email against multiple users and automatic extendability for other contact methods.. NOT modelling this according to the real world entities will likely result in significant remodeling in the future.