Let's say that your parents take you to the supermarket and when they get to the cashier they forgot about an specific item (for example: ketchup) and they send you to look for it really quick so they don't lose their turn on the waiting line.
So, you start running around looking for that specific ketchup bottle that you and your parents love, but every time you ask an employee they can't give you a straight answer like:
"Go to hall #4"
Instead, they tell you:
"Oh, hmm... well. You can try getting it from hall #1, #3, #5, and maybe #8."
In this situation you would have to check all 4 halls looking for the exact ketchup bottle, since you know your parents would be mad if you come back with a different one that they expect you to bring.
That means there's something wrong with the way they organize their items.
Normalization is basically the way you organize your data so that there's only a single place where to look for it when you need it. This helps avoid duplication as well as how fast your applications can request it.
But there are different levels of normalization, sometimes you can over normalize your data, and that would affect your workflow negatively, because of over complexity. So, you may wanna do some research about that with step-by-step examples, that's the best way I found to understand this topic.
You're 5. You're enrolled at pre-school. They take your name, your address, your date-of-birth, your parents, and your parents' phone numbers, and they stick them in a big table, called "Contacts and Info".
Student
Address
DOB
Parent 1
Parent 2
Parent Phone 1
Parent Phone 2
Your parents also donate some money each month to pay for extra-curricular outings and other things, so they take the payment info for that, in another big table called "Donors".
Parent
Address
Credit Card Info
Phone Number
The Break-Up
At some point, your parents split up for some reason, and one of them starts seeing someone else. Which is fine, but now you have 3 parents the school can call, and they only have two places for parent info.
So, instead, they decide to split up the first big table:
Student
DOB
Student
Parent
Parent Phone
Parent Address
So now they can store as many parents as people have.
The Work Phone
One of your parents starts to work on a military base. They're not allowed to keep their normal phone on them all the time, because it's not TOP SECRET-worthy and doesn't shoot lasers, but they are allowed to give the school the phone number of their awesome laser-shooting phone as well. So now the school needs to store two phone numbers for one person.
They could make two rows in the table with the addresses, but then they'd be storing the address twice. If they wanted to change the address for someone in the future, they'd have to make very sure they updated both copies, or else they wouldn't know where to send things.
They also look at the table for Donors, where they're storing phone numbers as well, (so they'd have the same problem) and decide they can make this simpler:
Student
DOB
Student
Parent
Parent
Address
Parent
Phone Number
Parent
Credit Card Info
Like this, they can store any number of phone numbers, addresses, or credit cards per parent, and they'll only ever need to change the information in one place.
What'll now be a pain is changing someone's name, so they might also give people a number internally and use those instead:
ID
Name
Student ID
DOB
Student ID
Parent ID
Parent ID
Address
Parent ID
Phone Number
Parent ID
Credit Card Info
The Cost
The cost of doing all of this is that if you want to look up, for example, what address corresponds to which phone number, you have to look up the Parent ID for both, joining those tables together. You've also had to store these ID numbers which you didn't have before, but hopefully you've saved more storage in not copying other information.
Normalisation
This process of breaking down your data into exact minimal relationships and storing those separately is database normalisation. There are various numbered normal forms which describe how much you've done this, satisfying various constraints and conditions.
NB also: in some databases, the speed cost of joining all this data back together is not what you want, and storing all the data for every row is better!
Suppose you have an e-commerce store, specialized in selling fine beers.
Imagine you have those tables:
beer for the beers themselves
client are the clients
review are reviews left by clients on beers
order are the orders
When a client leaves a review, you don't want to embed the client's data inside the review but just to create a link. This way when the client changes his profile picture it gets updated on all his comments at once.
However when a client makes an order you want to copy both the client's information and the product's information (like the price). This way even if the client changes his address later or the product changes its price then the order still is accurately describing what was paid and where it was shipped.
All you're looking for is the truth, nothing more. If a single item of data were to appear twice in a database, which one would you trust? On the other hand, you need to make sure that the data you're storing is the truth.
All the above are quite explanatory, but let me add a few things.
When a database has tables with data being repeated that data that is repeated is said to be redundant(Not needed and should be eliminated or better still be organized in a better way).
That better way of data organization in order to reduce REDUNDANCY is called NORMALIZATION.
Normalization is the reduction of data duplication i.e redundancy. This concept was introduced by DR Edgar F. Codd. and he formulated 4 forms in which data redundancy can be archived. The forms are as follows : 1NF, 2NF, 3NF and 4NF(formulated by Codd and Boyce A.K.A. BCNF).
1NF (First Normalization form)
Now this forms says that if your data is redundant in the following ways:
*one table cell containing multiple values
*one column/attribute having repeating groups i.e phoneNumber1, phoneNumber2 etc
*your rows dont have primary keys that uniquely identifies them
You should do this:
*Each should cell contain single values
*Move repeating groups i.e phoneNumber1, phoneNumber2 to seprate tables
*Have primary keys that uniquely identifies each row.
2NF (Second Normalization form)
Now this forms says that if your data is redundant in the following ways:
*If you have a composite key which your other non primary keys dont depend on.
. . . it basically says that all non-prime keys(keys that are not primary, composite or candidate) should depend on the composite keys that exist in a table else, move the non dependent ones to another table. This form focuses on Functional dependency i.e B is functionally dependent on A if A affects B.
3NF (Third Normalization form)
This form is quite similar to the Third normal form but it goes on to say that there shouldnt be any existence of Referential Dependency i.e a non-prime key should not depend on another non prime key instead, it should depend on only the existing primary or composite keys.
NOTE:
The 2nd, 3rd and 4th forms require that forms prior to them be first fulfilled.
i.e for a table to be in its fourth form, it must first pass the 1st, 2nd and 3rd forms.
Top comments (8)
Supermarket
Let's say that your parents take you to the supermarket and when they get to the cashier they forgot about an specific item (for example: ketchup) and they send you to look for it really quick so they don't lose their turn on the waiting line.
So, you start running around looking for that specific ketchup bottle that you and your parents love, but every time you ask an employee they can't give you a straight answer like:
"Go to hall #4"
Instead, they tell you:
"Oh, hmm... well. You can try getting it from hall #1, #3, #5, and maybe #8."
In this situation you would have to check all 4 halls looking for the exact ketchup bottle, since you know your parents would be mad if you come back with a different one that they expect you to bring.
That means there's something wrong with the way they organize their items.
Normalization is basically the way you organize your data so that there's only a single place where to look for it when you need it. This helps avoid duplication as well as how fast your applications can request it.
But there are different levels of normalization, sometimes you can over normalize your data, and that would affect your workflow negatively, because of over complexity. So, you may wanna do some research about that with step-by-step examples, that's the best way I found to understand this topic.
Great explanation!!
School Records
So!
You're 5. You're enrolled at pre-school. They take your name, your address, your date-of-birth, your parents, and your parents' phone numbers, and they stick them in a big table, called "Contacts and Info".
Your parents also donate some money each month to pay for extra-curricular outings and other things, so they take the payment info for that, in another big table called "Donors".
The Break-Up
At some point, your parents split up for some reason, and one of them starts seeing someone else. Which is fine, but now you have 3 parents the school can call, and they only have two places for parent info.
So, instead, they decide to split up the first big table:
So now they can store as many parents as people have.
The Work Phone
One of your parents starts to work on a military base. They're not allowed to keep their normal phone on them all the time, because it's not TOP SECRET-worthy and doesn't shoot lasers, but they are allowed to give the school the phone number of their awesome laser-shooting phone as well. So now the school needs to store two phone numbers for one person.
They could make two rows in the table with the addresses, but then they'd be storing the address twice. If they wanted to change the address for someone in the future, they'd have to make very sure they updated both copies, or else they wouldn't know where to send things.
They also look at the table for Donors, where they're storing phone numbers as well, (so they'd have the same problem) and decide they can make this simpler:
Like this, they can store any number of phone numbers, addresses, or credit cards per parent, and they'll only ever need to change the information in one place.
What'll now be a pain is changing someone's name, so they might also give people a number internally and use those instead:
The Cost
The cost of doing all of this is that if you want to look up, for example, what address corresponds to which phone number, you have to look up the
Parent ID
for both, joining those tables together. You've also had to store these ID numbers which you didn't have before, but hopefully you've saved more storage in not copying other information.Normalisation
This process of breaking down your data into exact minimal relationships and storing those separately is database normalisation. There are various numbered normal forms which describe how much you've done this, satisfying various constraints and conditions.
NB also: in some databases, the speed cost of joining all this data back together is not what you want, and storing all the data for every row is better!
Edits: Argh, no Markdown table support!
I love your school example! Great analogy.
It's all about having a single source of truth.
Suppose you have an e-commerce store, specialized in selling fine beers.
Imagine you have those tables:
beer
for the beers themselvesclient
are the clientsreview
are reviews left by clients on beersorder
are the ordersWhen a client leaves a review, you don't want to embed the client's data inside the review but just to create a link. This way when the client changes his profile picture it gets updated on all his comments at once.
However when a client makes an order you want to copy both the client's information and the product's information (like the price). This way even if the client changes his address later or the product changes its price then the order still is accurately describing what was paid and where it was shipped.
All you're looking for is the truth, nothing more. If a single item of data were to appear twice in a database, which one would you trust? On the other hand, you need to make sure that the data you're storing is the truth.
All the above are quite explanatory, but let me add a few things.
When a database has tables with data being repeated that data that is repeated is said to be redundant(Not needed and should be eliminated or better still be organized in a better way).
That better way of data organization in order to reduce REDUNDANCY is called NORMALIZATION.
Normalization is the reduction of data duplication i.e redundancy. This concept was introduced by DR Edgar F. Codd. and he formulated 4 forms in which data redundancy can be archived. The forms are as follows : 1NF, 2NF, 3NF and 4NF(formulated by Codd and Boyce A.K.A. BCNF).
1NF (First Normalization form)
Now this forms says that if your data is redundant in the following ways:
*one table cell containing multiple values
*one column/attribute having repeating groups i.e phoneNumber1, phoneNumber2 etc
*your rows dont have primary keys that uniquely identifies them
You should do this:
*Each should cell contain single values
*Move repeating groups i.e phoneNumber1, phoneNumber2 to seprate tables
*Have primary keys that uniquely identifies each row.
2NF (Second Normalization form)
Now this forms says that if your data is redundant in the following ways:
*If you have a composite key which your other non primary keys dont depend on.
. . . it basically says that all non-prime keys(keys that are not primary, composite or candidate) should depend on the composite keys that exist in a table else, move the non dependent ones to another table. This form focuses on Functional dependency i.e B is functionally dependent on A if A affects B.
3NF (Third Normalization form)
This form is quite similar to the Third normal form but it goes on to say that there shouldnt be any existence of Referential Dependency i.e a non-prime key should not depend on another non prime key instead, it should depend on only the existing primary or composite keys.
The fourth one is quite confusing cant explain it
(Here is a link )[en.wikipedia.org/wiki/Fourth_norma...]
NOTE:
The 2nd, 3rd and 4th forms require that forms prior to them be first fulfilled.
i.e for a table to be in its fourth form, it must first pass the 1st, 2nd and 3rd forms.
i hope this helps
I didn't know about NFs. Thank you!
make data from single source