In database design, one of the most fundamental and basic concepts is normalization. It gets confusing at some point, and it has taken some time for me to grasp how to normalize a database through recurrent practice.
The primary goal of normalization, I believe, is to eliminate redundancy and ensure data integrity. In the end, normalization leads to efficient data retrieval and simplifies database maintenance.
What is Database Normalization?
Database normalization involves structuring a relational database to reduce repetitive data/redundancy and improve data integrity. It's typically done by splitting large tables into smaller tables, related using foreign keys, which ensures that data is stored only once.
Normalization is carried out through a series of stages/normal forms. Each form introduces more rules and restrictions to further organize the data.
There are five major normal forms (1NF - 5NF), but in practice, the first three normalize a database in a real-world scenario.
Why is normalization important?
Normalizing databases helps avoid these anomalies, especially with systems that perform CRUD operations:
Data redundancy:
- Repititing the same data in multiple places (1 or more tables).
- Update anomalies: Owing to having multiple entries, it poses difficulty updating redundant data in multiple places and can lead to inconsistency with data.
- Insert/Delete anomalies: You cannot trust adding or removing certain data without having other entries already present.
Normal Forms Overview
Normalization is done in forms/stages, each referred to as a normal form.
First
- Normal Form (1NF): Ensures each column contains indivisible values and that there are no repeating groups. This is called ATOMICITY
- Second Normal Form (2NF): Builds on 1NF, ensuring that all non-key attributes are fully dependent on the primary key.
- Third Normal Form (3NF): Ensures that non-key attributes are not only dependent on the primary key but also independent of each other.
Normalization in an E-commerce Database: An Example
Let us consider a basic e-commerce system where customers place orders for products. I usually will advise having all required fields written out on a pen or a wordpad.
An initial database structure would look like this:
Unnormalized Table or the Flat Structure
CustomerName (varchar), CustomerEmail (varchar), ProductID (varchar), quantity, price (decimal), orderdate (date), etc
The Three Normal Forms
- First Normal Form (1NF): Eliminate Repeating Groups A table is in 1NF when: All columns contain
- atomic (indivisible) values.
- Each record has a unique identifier, known as a primary key.
Example:
Table 1
Table 2
Consider the two tables above:
Table 1 is in its normalized form, while Table 2 has the Products and Quantity columns, which contain multiple values; this violates 1NF.
To normalize Table 2, we split the table into two separate tables: orders and order_items.
After 1NF we have tables 1 and 2 as follows:
First normal form
In 1NF, each field contains atomic values, and we've eliminated repeating groups of products within a single order.
Table 1 seems straightforward but has some problems:
Data redundancy: CustomerName, CustomerEmail is repeated for every order. Ideally, since there will be multiple entries, they can go into a table, say the customer information table, but we will be back to this.
- Second Normal Form (2NF): Eliminate Partial Dependency A table is in 2NF when: It meets all the requirements of 1NF/Passed 1NF. All non-primary-key attributes are fully dependent on the primary key.
In order_items table, from the 1NF, the OrderID and ProductName together can form a composite key (data modeling technique that combines multiple columns to create a unique identifier for each record), but Quantity is already dependent on the combination of the two attributes. (OrderID and ProductName).
This means, however, that if we have data in the table that doesn't relate to the order, we would need to move it to a different table.
ProductPrice is dependent on ProductName, and not OrderID. Just like the price of products in a store does not determine orders made.
Therefore, to achieve 2NF, we need to separate product-related data from order-related data. We create a new products table.
By moving ProductPrice and ProductName to a new Products table, we ensure that every non-key attribute in the OrderItems table depends on the the composite key (OrderID, ProductID). This satisfies the goal of the second normal form.
After 2NF:
Table 2 in 2NF
Now, ProductPrice is stored in the products table, which keeps product data separate from order data, ensuring that each piece of data is only stored once.
- Third Normal Form (3NF): Eliminate Transitive Dependency A table is in 3NF when: It is already in 2NF/Passed 2NF. All non-key attributes are dependent only on the primary key, and there are no transitive dependencies (i.e., non-primary-key attributes should not depend on other non-primary-key attributes).
Table 2 in 3NF
Suppose we have a CustomerAddress, for orders, it will be in the Customers Table and is dependent on CustomerName, not OrderID. This is the concept of transitive dependency, meaning the address is indirectly dependent on the order.
To achieve 3NF, we move the customer data into CustomersTable, we have an OrdersTable, ProductsTable, and OrderDetailsTable at this point.
Advantages of Normalization
- - Data Integrity: Normalization ensures that the data is stored consistently. So, if a customer changes their name, you only need to update it in the Customers table and it reflects on all related tables.
- - Reduction in Redundancy: By eliminating duplicate data, storage requirements are reduced. This is crucial in asystem where thousands of products, customers, and orders are stored.
- - Efficient Queries: Normalization helps in writing efficient queries and data is stored logically in separate tables, reducing the chances of anomalies.
Disadvantages of Normalization
While normalization offers many advantages, it can also introduce some challenges:
Complex Querie
- s: As data is spread across multiple tables, queries may become more complex and require more JOIN operations, potentially impacting performance.
- Overhead: Managing relationships between normalized tables requires proper indexing and careful query optimization.
Denormalization: A Trade-Off for Performance
In certain or specific scenarios, especially in high-performance systems, some denormalization (reintroducing redundancy) may be necessary to optimize read performance. For example, storing the product name directly in the OrderDetails table can eliminate the need for a JOIN in read-heavy applications.
Conclusion
Normalization is an essential process in designing a relational database that is efficient, scalable, and maintainable. Normalization ensures that data such as customer details, products, and orders are stored in a way that minimizes redundancy, prevents anomalies, and maintains data integrity, create a well-structured database that supports the smooth operation.
However, while it is great to normalize, keep in mind the trade-off between normalization and performance, especially where necessary to optimize read-heavy applications.
Top comments (0)