Before we go any further, let's look at some commonly used terms in normalization and denormalization.
Primary key: Column or group of columns that can be used to uniquely identify every row of the table.
Composite key: A primary key made up of multiple columns.
Super key: Set of all keys that can uniquely identify all the rows present in a table.
Candidate key: Attributes that identify rows uniquely in a table.
Foreign key: It is a reference to a primary key of another table.
Alternate key: Keys that are not primary keys are known as alternate keys.
Surrogate key: A system-generated value that uniquely identifies each entry in a table when no other column was able to hold properties of a primary key.
Partial dependency: Occurs when the primary key determines some other attributes.
Functional dependency: It is a relationship that exists between two attributes, typically between the primary key and non-key attribute within a table.
Transitive functional dependency: Occurs when some non-key attribute determines some other attribute.
Database anomaly happens when there is a flaw in the database due to incorrect planning or storing everything in a flat database. This is generally addressed by the process of normalization.
There are three types of database anomalies:
Insertion anomaly: Occurs when we are not able to insert certain attributes in the database without the presence of other attributes.
Update anomaly: Occurs in case of data redundancy and partial update. In other words, a correct update of the database needs other actions such as addition, deletion, or both.
Deletion anomaly: Occurs where deletion of some data requires deletion of other data.
Let's consider the following table which is not normalized:
Let's imagine, we hired a new person "John" but they might not be assigned a team immediately. This will cause an insertion anomaly as the team attribute is not yet present.
Next, let's say Hailey from Team C got promoted, to reflect that change in the database, we will need to update 2 rows to maintain consistency which can cause an update anomaly.
Finally, we would like to remove Team B but to do that we will also need to remove additional information such as name and role, this is an example of a deletion anomaly.
Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
The goal of normalization is to eliminate redundant data and ensure data is consistent. A fully normalized database allows its structure to be extended to accommodate new types of data without changing the existing structure too much. As a result, applications interacting with the database are minimally affected.
Normal forms are a series of guidelines to ensure that the database is normalized. Let's discuss some essential normal forms:
For a table to be in the first normal form (1NF), it should follow the following rules:
- Repeating groups are not permitted.
- Identify each set of related data with a primary key.
- Set of related data should have a separate table.
- Mixing data types in the same column is not permitted.
For a table to be in the second normal form (2NF), it should follow the following rules:
- Satisfies the first normal form (1NF).
- Should not have any partial dependency.
For a table to be in the third normal form (3NF), it should follow the following rules:
- Satisfies the second normal form (2NF).
- Transitive functional dependencies are not permitted.
Boyce-Codd normal form (or BCNF) is a slightly stronger version of the third normal form (3NF) used to address certain types of anomalies not dealt with by 3NF as originally defined. Sometimes it is also known as the 3.5 normal form (3.5NF).
For a table to be in the Boyce-Codd normal form (BCNF), it should follow the following rules:
- Satisfied the third normal form (3NF).
- For every functional dependency X → Y, X should be the super key.
There are more normal forms such as 4NF, 5NF, and 6NF but we won't discuss them here. Check out this amazing video that goes into detail.
In a relational database, a relation is often described as "normalized" if it meets the third normal form. Most 3NF relations are free of insertion, update, and deletion anomalies.
As with many formal rules and specifications, real-world scenarios do not always allow for perfect compliance. If you decide to violate one of the first three rules of normalization, make sure that your application anticipates any problems that could occur, such as redundant data and inconsistent dependencies.
Here are some advantages of normalization:
- Reduces data redundancy.
- Better data design.
- Increases data consistency.
- Enforces referential integrity.
Let's look at some disadvantages of normalization:
- Data design is complex.
- Slower performance.
- Maintenance overhead.
- Require more joins.
Denormalization is a database optimization technique in which we add redundant data to one or more tables. This can help us avoid costly joins in a relational database. It attempts to improve read performance at the expense of some write performance. Redundant copies of the data are written in multiple tables to avoid expensive joins.
Once data becomes distributed with techniques such as federation and sharding, managing joins across the network further increases complexity. Denormalization might circumvent the need for such complex joins.
Note: Denormalization does not mean reversing normalization.
Let's look at some advantages of denormalization:
- Retrieving data is faster.
- Writing queries is easier.
- Reduction in number of tables.
- Convenient to manage.
Below are some disadvantages of denormalization:
- Expensive inserts and updates.
- Increases complexity of database design.
- Increases data redundancy.
- More chances of data inconsistency.
This article is part of my open source System Design Course available on Github.
Learn how to design systems at scale and prepare for system design interviews
System Design Course
Hey, welcome to the course. I hope this course provides a great learning experience.
Table of contents
- N-tier architecture
- Message Brokers
- Message Queues
- Enterprise Service Bus (ESB)
- Monoliths and Microservices
- Event-Driven Architecture (EDA)
- Event Sourcing
- Command and Query Responsibility Segregation (CQRS)
- API Gateway
- REST, GraphQL, gRPC
- Long polling, WebSockets, Server-Sent Events (SSE)