Optimization is something I've always been in love with. This mostly takes the form of me trying to squeeze as much time for sleep as possible, so I utilize scheduling shortcuts so I can spend as much time as possible warm and snug in bed. This definitely leaks over into my interests because I genuinely enjoy doing things as fast and consistently as possible while maintaining quality. So today I'm going to venture into a topic that isn't exactly associated with words like "fun" or "interesting", but I've found that even being aware that normalization is a thing that exists has helped me in other aspects of object orientated programming. So what is normalization and how does it relate to OOP?
Normalization is a process in which we optimize our databases, making them more flexible, expandable and easier to use. It is an imperative competency for someone working directly with databases. This process can be applied at any time during development and works well to have in mind during the initial creation of your object models and can be reapplied as you tweak and discover new relationships for your objects. This may sound daunting but I'll hold your hand through this whole process so you can come back and reference this guide on your own projects.
A few definitions before we start:
To achieve "normalization" or to consider our databases "normalized" we need to take our databases through a few steps and for each step we consider our databases as on a new level or "form". I'll take you through the first 3 forms because going any further than that you start floating off into the realm of mathematical possibility instead of real-world practicality. Once we've hit the 3rd form you can safely consider your database to be normalized. These steps will be shortened as 1NF, 2NF and 3NF respectively.
A Primary Key candidate requires two things:
Never allowed to be Null/Nil.
Must be a Unique Value.
May never change.
Luckily in our Object Oriented language our instance objects all have a unique key called it's Object ID. Our databases on the other hand like SQLite, automatically include a unique auto-incrementing row.
Foreign keys are just primary keys from other tables. Generally a linking table will have it's own primary key and then two or more foreign keys. These connect two separate tables together.
Through the Forms
Taking your databases through the first three normalization steps will drastically improve the quality of the data that you are storing. Luckily utilizing object oriented language features such as classes make it extremely easy to have an efficient database without having to think about what's going on behind the veil. Before normalization starts you need to have a primary key, with an object oriented language this will be the Object ID itself (found with .self in Ruby), or when stored in the database will automatically be added as the "id" column that is usually auto-incremented.
As you may know, attributes on a class map to columns in a database quite neatly. Taking your data through the first normal form means that each column and each cell should contain one value only. This means that the instance object of the Toy class (lets call him "Woody") has a column in the database with the title "Name" with the value "Woody". But what happens if we want to store another Toy with the name "Buzz Lightyear"? In the table below I've done just that and for good measure lets also put in a third Toy with the name "Action Lightyear Aldrin".
The trouble starts when we start asking our databases questions like "Hey database, can you get me all of the Toys with the last name of 'Lightyear'?" Well, you don't have a column titled "Last Name" so you have to query the "Name" column with something like "Hey database, get me all the Toys that have the name 'Lightyear'." As I'm sure you would expect it will return both Buzz and Aldrin which isn't what you want. This may seem trivial for a Toybox, but if you're querying a database with hundreds of thousand or millions of rows, you can see how having data not broken down into smaller pieces can quickly become completely infeasible to manage and manipulate.
Alternatively, something to look out for is when values are all jammed into one cell. As shown below.
This spits in the face of 1st Normal Form and must be corrected, though, this is a situation unlikely to happen if you're using your classes correctly.
This is the first thing 1NF is about: Atomicity.
In object oriented languages, this problem is generally taken care of by erroring out when you try to put more than one value in a cell and suggesting you use an array.
Continuing on, lets fix this problem real quick and add a row splitting up the name into three parts and letting the additional names optionally be null/nil, like in the case of Woody.
We'll also add an accessory column because we want to keep track of each Toy's accessories that they are boxed with.
But not every toy only comes with one accessory so we need more columns to hold each item the toy comes with.
The problem with this strategy is that for each new instance of our Toy class (newly renamed from ToyBox), the object might have a different number of accessories. If you notice that you have column names that end in incrementing numbers, there is a great chance that you have a problem that can be solved by taking those columns out and making a new class that has a relationship with the toy and maybe storing the relationship in a linking table if necessary (like if you discover you actually have a many-to-many relationship).
Second form normalization is one that as far as I can tell is taken care of quite well with OOP without any human intervention. It mostly has to deal with Composite Primary Keys, where two columns are used to uniquely identify a row in a table. I'm sure there are cases when this is good to keep in mind but I'm willing to admit that at this point I haven't run into a situation where this was necessary.
Basically 2NF states that any non-key field should be dependent on the entire primary key. This is situation usually presents itself on linking tables.
Third normal form states:
Can you find the value of a field from any other non-key field?
Below you can see that you can always find out the total of a toy order if you know how the price of the toy and the amount sold.
Best practice would be to eliminate the "Total" row and use a method or a built in database feature to calculate that information.
As you may have noticed object oriented languages and modern databases keep these principals in mind from the get go but it can be tempting to try to make your own shortcuts. Try to avoid those traps and take the time to analyze what data you're storing and the relationships between that data.