Next XYZ Courses (9 Part Series)
Normalization is a technique for organizing data in a database. It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates.
The stages of organization are called normal forms. In this tutorial we will be redesigning a database for a construction company and ensuring that it satisfies the three normal forms:
- Data is stored in tables with rows uniquely identified by a primary key
- Data within each table is stored in individual columns in its most reduced form
- There are no repeating groups
- Everything from 1NF
- Only data that relates to a table’s primary key is stored in each table
- Everything from 2NF
- There are no in-table dependencies between the columns in each table
Note that there are actually six levels of normalization; however, the third normal form is considered the highest level necessary for most applications so we will only be discussing the first three forms.
Let's get started!
This tutorial is adapted from Next Tech's Database Fundamentals course which comes with an in-browser MySQL database and interactive tasks and projects to complete. You can get started here for free!
The database we will be working with in this tutorial is for Codey's Construction company (Codey is a helpful coding bot that works with you in the course mentioned earlier). As you can see from the schema above, the database contains the tables
project_employees. Recently, they have decided to add the table
customers to store customer data.
Unfortunately, this table has not designed in a way that satisfies the three forms of normalization... Let's fix that!
First normal form relates to the duplication and over-grouping of data in tables and columns.
Codey’s Construction's table
customers violates all three rules of 1NF.
- There is no primary key! A user of the database would be forced to look up companies by their name, which is not guaranteed to be unique (since unique company names are registered on a state-by-state basis).
- The data is not in its most reduced form. The column
contact_person_and_rolecan be further divided into two columns, such as
- There are two repeating groups of columns - (
project1_feedback) and (
The following SQL statement was used to create the
CREATE TABLE customers ( name VARCHAR(255), industry VARCHAR(255), project1_id INT(6), project1_feedback TEXT, project2_id INT(6), project2_feedback TEXT, contact_person_id INT(6), contact_person_and_role VARCHAR(300), phone_number VARCHAR(12), address VARCHAR(255), city VARCHAR(255), zip VARCHAR(5) );
By modifying some columns, we can help redesign this table so that it satisfies 1NF.
First, we need to add a primary key column called
id with data type
ALTER TABLE customers ADD COLUMN id INT(6) AUTO_INCREMENT PRIMARY KEY FIRST;
With this statement, we added an automatically incrementing primary key as the first column in the table.
To satisfy the second condition, we need to split the
ALTER TABLE customers CHANGE COLUMN contact_person_and_role contact_person VARCHAR(300); ALTER TABLE customers ADD COLUMN contact_person_role VARCHAR(300) AFTER contact_person;
Here, we simply renamed it as
contact_person, and added a column
contact_person_role immediately after it.
To satisfy the third condition, we need to move the columns containing project IDs and project feedback to a new table called
project_feedbacks. First, let's drop these columns from the
ALTER TABLE customers DROP COLUMN project1_id, DROP COLUMN project1_feedback, DROP COLUMN project2_id, DROP COLUMN project2_feedback;
And then create the
CREATE TABLE project_feedbacks ( id INT(6) AUTO_INCREMENT PRIMARY KEY, project_id INT(6), customer_id INT(6), project_feedback TEXT );
As you can see, there are no more repeating groups in either the
project_feedbacks table or the
customers table. We still know which customer said what since
project_feedbacks.customer_id refers back to the
customers table satisfies 1NF! Let's move on to second normal form.
To achieve second normal form, a database must first satisfy all the conditions for 1NF. After this, satisfying 2NF requires that all data in each table relates directly to the record that the primary key of the table identifies.
We are in violation of 2NF because the
phone_number columns track data that relate to the contact person, not the customer. If the contact person for a customer changes, we would have to edit all of these columns, running the risk that we will change the values in one of the columns but forget to change another.
To help Codey's Construction fix this table to satisfy 2NF, these columns should be moved to a table containing data on the contact person. First, let's remove the columns in
customers that are not related to our primary key:
ALTER TABLE customers DROP COLUMN contact_person, DROP COLUMN contact_person_role, DROP COLUMN phone_number;
Note that we kept the
contact_person_id so we still know who to contact. Now, let's create our new table
contact_persons so we have somewhere to store data about each contact.
CREATE TABLE contact_persons ( id INT(6) PRIMARY KEY, name VARCHAR(300), role VARCHAR(300), phone_number VARCHAR(15) );
Now, if the contact person for a customer changes, the construction company just has to insert a record into the
contact_persons table and change the
contact_person_id in the
For a database to be in third normal form, it must first satisfy all the criteria for 2NF (and therefore, also 1NF).
Then, each column must be non-transitively dependent on the table’s primary key. This means that all columns in a table should rely on the primary key and no other column. If
column_a relies on the primary key and also on
column_a is transitively dependent on the primary key so the table does not satisfy 3NF.
Does your brain hurt from reading that? Don't worry! It's explained more below.
The table currently has transitively dependent columns. The transitively dependent relationship is between
zip. The city in which a customer is located relies on the customer, so this satisfies 2NF; however, the city also depends on the zip code. If a customer relocates, there may be a chance we update one column but not the other. Because this relationship exists, the database is not in 3NF.
To fix our database to satisfy 3NF, we need to drop the
city column from
customers, and create a new table
zips to store this data:
ALTER TABLE customers DROP COLUMN city; CREATE TABLE zips ( zip VARCHAR(5) PRIMARY KEY, city VARCHAR(255) );
That's it! Finding issues that violate 3NF can be difficult, but it's worth it to ensure that your database is resilient to errors caused by only partially updating data.
I hope you enjoyed this tutorial on database normalization! Codey's Construction's database now satisfies the three forms of normalization.
If you'd like to continue learning about databases, Next Tech's Database Fundamentals course covers all you need to know to get started with databases and SQL. By helping an interactive coding bot named Codey, you will learn how to create and design databases, modify data, and how to write SQL queries to answer business problems. You can get started for free here!