Databases, while something I don't work with every day, have always been an interest of mine. I used to spend a lot of time on the StackOverflow MySQL tag just for fun. I've also spent a lot of time discussing database design with my roommate, and noticed a reoccurring theme among these conversations - we spent a lot of time discussing how to structure the database.
Questions that usually arise are "how should I organize this table" or "how do I handle a relationship between these two entities?" These questions are very important, because if we don't design our database properly, we can end up with something that will not be easily maintainable in the long run.
Let's begin with an example of an improperly designed table, and the risks we face as a result. Consider being asked to design a database table to record a student's name, year, and the classes they're taking. We may end up with one like this:
This table meets the requirements listed, but has a couple problems. This table has redundant data, and also lacks integrity. These two concepts are essential to proper database design.
Data is redundant if it appears more than once. I'm not referring to a specific word or identifier appearing multiple times. In the example above, there are two rows that say Adam is a freshman. This is redundant data that doesn't need to be captured twice, or even a third time if Adam took another class.
Lifting an explanation from Wikipedia, "data integrity is the maintenance of, and the assurance of the accuracy and consistency of, data over its entire life-cycle."
In simple terms, I should be able to update, insert, and remove data without any negative effects on the other data. Consider our students table in the last section. It has the possibility for three data anomalies:
- Update Anomaly - If I updated row 2 to modify the year column to be "Sophomore", I would have one record where Adam is a freshman and one where he is a sophomore. This inconsistency would not allow me to determine which row is correct.
- Insertion Anomaly - If I wanted to insert just a new student, I couldn't do so. At least, not without setting the "class" column to null. Similarly, I can't add a record for a class without a corresponding student, and year. This limitation prevents us from properly recording data that may otherwise be important to us.
- Deletion Anomaly - If I were to delete row 3, I have two problems. I've lost any record of Prince being a student, and I've lost any record of History being a class. This is a problem because I lose two pieces of information, even if I only want to delete one of them.
So how do we design our databases to avoid these two problems? The answer is normalization. Database normalization is a process of organizing a database to avoid redundant data and ensure data integrity. A normalized database would not be at risk of these insert, update, and delete anomalies.
Normalization isn't defined by one single criterion. In other words, you don't just say "if we have X, then our database is normalized."
Instead, normalization is defined through a series of normal forms, each one building upon the last to ensure data integrity and avoid redundancy. There are several normal forms, and throughout this series we are going to work through three of them: first normal form (1NF), second normal form (2NF), third normal form (3NF).
In the next post we'll learn about this first normal form and how to design tables that comply with 1NF.