JTK

Posted on Sep 18, 2022 • Edited on Sep 20, 2022

Data modeling crash course

#sql #interviewing #career #database

Intro

Data modeling questions are a fairly normal part of the interview process, and it is worth knowing some basic guidelines for how to organize data. Although in this piece I will primarily talk about this at the database level, it is good knowledge to have across the stack - knowing when a piece of logic should be broken out into a reusable component is something frontend engineers will frequently run into, and it gets into similar logical territory because we are thinking about how it makes sense to organize our information, what belongs together, and what should be broken up.

Because it is the predominant type of database, we are mainly going to talk about relational databases and how to model data using good practices for relational databases specifically. But it is worth knowing that there are many other types of database out there, and what count as “good practices” can be different between different database systems. In addition to relational databases, there are:

NoSQL
Key/value stores
Graph databases
Network databases
Object oriented databases
Columnar databases

Among others. Before you feel pressure to learn all that, it is worth saying that I’ve been doing this a long time and have really only ever significantly dealt with relational databases, key value stores, and NoSQL. There is also some crossover in functionality between databases, at times. One of my favorite things about MongoDB is the ability to store data structures similar to JSON, but PostgreSQL, a relational database, also allows you to store JSON if you need to. Some NoSQL databases also support query languages like SQL (structured query language), the language of relational databases.

Basic ideas

I feel like data modeling is one of those terms that can sound intimidating before you know what it is. I had several of these as a junior developer where when I would hear the term, for some reason my mind would go blank. (”Scripting language” was one of these, even though all I knew was scripting languages 🤷‍♀️).

If hearing data modeling makes you think of training AI models, or complex statistical models like they do for political elections, you can put that to the side. You should be happy to know, data modeling for relational databases more or less follows common sense.

Relational databases and SQL have actually come to be a favorite technology of mine, because the principles so closely echo common sense. It is also somewhat unique for a technology as old as relational databases to have survived the rapidly changing technological landscape since the theories behind them were created by by E.F. Codd between the 1960s-1980s. It is somewhat rare for a technology so old to still work so well and be so well liked by the people using it. The history behind this technology is interesting, and you can read more about it here.

If I could summarize the “common sense” ideas that govern good data modeling for a relational database for someone with zero background, I’d probably highlight the following:

We want to follow separation of concerns by breaking up data as much as possible into distinct tables
We want to avoid duplicating data, because this makes more work for us if we need to delete or update anything
We want the components of our tables to be clear and independently meaningful (the value in column B should not change in meaning in relation to column A, it should stand on its own)
We want to structure our data to have safeguards in place to avoid “junk data” - for instance, do we want to allow a user with a typo on a state abbreviation to send the value “NZ” instead of “NC” to the database? Probably not, because it may cause our application not to work as expected, and we may have to clean up bad data later

Not too crazy, right?

Quick detour into SQL-specific lingo

Some of the things in our toolkit to make the above guidelines happen are indexes, keys and constraints. You can think of these like rules. When you set up these rules, if data comes to your database that violates the rules, instead of writing that data an error will be thrown. This is actually what you want. Bad data can cause all kinds of headaches.

Let’s quickly go into these concepts before we head into some examples and rules of thumb.

Primary Keys

Primary keys are a way for us to uniquely identify our records. This is typically done by creating a numeric ID field. Databases have logic around primary keys so that for each new record, the ID field is incremented by 1 for every new record. They also enforce logic so that you could not have two records with id 2, for instance. This makes dealing with our data much easier, because it is far more convenient for your application to be built around thing with ID 2 than thing with name x and address y and state z and… every time you need to pull up a record

Foreign Keys

It will be no surprise that in a relational database, we have representation of the relationships between our data. In simplest possible terms, a foreign key is when one table references another. In addition to describing the relationships between tables, foreign keys can be used to enforce data quality.

Lets consider that we have a table that collects address info, like the following:

Address	State	Zip code
123 tree street	AZ	87678
4 Cary Ct	NC	27617

What happens if the user inputs a typo in their state? We would have bad data that might cause our application not to work as expected. However if we have an address table and a state table, we could use foreign keys to say that whenever we try to add a new address, we will check that the state exists in the state table, indicating it is valid input.

Indexes

When you create a primary key field on a table, the database also creates what is called an index. The idea is a little like an index in a book, and makes finding things easier. Indexes make searching faster for the database engine, the underlying logic that carries out SQL queries. If you have a table with ID fields like this:

ID
1
2
…
1000000

And you are searching for record with ID 200, the database can use binary search to get to the correct record with great efficiency. How this works is the database might select the middle ID value of all the records, and see that it is 500,000. That’s way too high, so we can ignore the second half of those results, anything from 500001 onwards. We can pick from the middle again, and get ID 250,000. Still way too high, so we can ignore the higher half of those results and pick from the middle again of the remaining lower half of ID numbers. So on and so forth. Binary search is a very efficient algorithm, so database indexes are an important tool for fast searches and good database performance.

Other constraints and keys

We have other ways of applying rules to our database to ensure we have only good data, and no repeated data. We can use something called UNIQUE constraints to create a rule that every record must have values in a certain column that don’t already exist in the database. If you had a database with usernames for instance, you could prevent someone registering with a username value that someone else had. (Side note, when you create a primary key, in addition to it automatically being indexed, it automatically also has a unique constraint out of the gate).

Sometimes, we also have situations where a traditional number ID field does not fit our needs. So we can also make what is called a composite key, and make a primary key that is the combination of two columns.

There are many other types of keys, constraints, and indexes, but it is out of the scope of this article which is for beginners.

Data modeling interviews: brainstorming

Data modeling is something that comes up from time to time in interviews. It may be part of some system design questions, or it may be an exercise of its own. In my experience, data modeling interview discussions are actually one of my favorites types of interviews. They are almost always common sense, straightforward, no trick questions, and I like data so I wind up enjoying them. They are also usually pretty collaborative. If you have gotten an interview like this, I don’t want to call something “easy” but I will say data modeling interviews are among the more pleasant interviews I’ve been part of, so go in with good vibes knowing you won’t be grilled on something random or asked some obscure factoid (ime).

In my experience, brainstorming is an important stage of ALL interviews that ask you to approach a technical problem, and these are no different. With data modeling, there are some guiding questions that reflect good relational data modeling principles, and are good to move through to show that you understand this space and can reason about data and how data is related.

Good brainstorming questions to ask yourself (and say out loud in interviews!):

What are the main entities in this problem? (Often, what you would have as a model in backend code directly corresponds to what you would have as a table or “entity” in the database)
When determining “data entities”, think about whether data is meaningful on its own, or only in the context of other data. It is generally good to split data up - good separation of concerns, a good way to keep tables smaller and more granular, good all around - but if data is not meaningful on its own, it may belong under the umbrella of another entity
In school learning backend code, sometimes we learn to develop models thinking about “is a…” and “has a…”. We can say, a car is a vehicle, and a motorcycle is a vehicle. Maybe we build a Vehicle class for that reason. We can say a motorcycle has a engine, a car has a engine. So maybe we need to track engine data and relate it back to our Vehicles. These concepts translate to the database too, and are worth thinking about planning your models
When you have related data, it is good to ask yourself what entity it is most related to. Sometimes data is equally related to multiple entities. In those cases, sometime it is worth breaking out a separate table reflecting the relationship back to both of two equally related entities
Relational databases deal with, you guessed it, relationships! Some relationships are hierarchical, like a boss with many reports. Some relationships are 1:1, like a website user would probably only have one account ID. Some relationships are many to many, like how a student can be in multiple classes and a class contains multiple students
When you have your main entities figured out, it is worth thinking about the smallest appropriate data type you can use for your data. Booleans and integers generally take up less space than text and can be faster to search. You have to be careful though, since people can run into trouble using INTEGER ID fields for databases with too many records, it is possible to hit the limit of integer values. For this reason sometimes there is another type, BIGINT to allow greater flexibility. There are rarely trick questions here, but it is worth demonstrating you understand the tradeoffs of various data types

Rules of thumb with examples

One thing I’ll cover quickly is that in some of the examples I’ll show you next I borrow concepts from the rules of database normalization. There are various “levels” of database normalization, and usually the higher the level your data conforms to, the more it is seen to follow good relational database practices. You can read more about normalization here.

💡Side note: I have never been asked to remember the definition of a specific form of normalization in an interview, and probably couldn’t if asked. Since most of the rules are common sense, I apply them constantly, but it is intuitive rather than the result of memorization. While it may come up in some rare, rare circumstance, I think it is worth mentioning that I have never been asked to regurgitate definitions of normalization. So if I were you, especially as a new coder, I would not worry about that tbh 🤷

Rule 1: No multi-value records

Consider we have an app like GoodReads where users can “favorite” books. What do you think if we have a table structured like what’s below?

user_id	favorites
1	The Snow Child, Pride and Prejudice, Eleni, The Poisonwood Bible
2	The Revenant, Ender’s Game, The Hobbit

With SQL syntax, if the user “unfavorited” a book in this list, you would have to re-write the entire record to apply that update. There is also not an easy way to insert a new favorite without re-writing this entire comma-separated list. Searching also gets slower, since instead of looking for an exact match if you want to find users who favorited a certain book, you would have to essentially loop through all the favorites of each record and see if the book title is among them.

The way to structure this data with better practices might involve splitting it up, and taking advantage of foreign keys which we chatted about before.

To summarize what we did here, and why it is good:

We can now do a simple search for user favorites with logic like where user = x and book = y on the user favorite table
If a user adds a new favorite, we add a tiny new record to the user favorite table
If there was a typo on the book title, instead of having to go through that giant list of favorites we had in the previous design and updating each record, we can simply update the title field of one single row in the books table, which is much simpler, and the user favorite table remains completely correct
Because we use foreign keys, we won’t have any “junk data” made for users that don’t exist or books that don’t exist as the result of application or user error

There’s also something cool you can do with SQL called CASCADE ON DELETE which means that if a user is deleted or a book is deleted, all the records related to it in the user favorite table also get deleted. There is also something called CASCADE ON UPDATE, so in the scenario of a typo potentially you could correct it in one place, and have it reflected everywhere. Cool, right!?

Rule 2: Deduplicate data

You might see the example above and think, “I don’t really feel like querying three tables, why don’t I add more user info to the user favorite table because we need it all to display on the favorites page”. You might wind up with a table something like this:

user_id	book_id	user_name	book_name
1	2	jtkaufman737	Homegoing
1	5	jtkaufman737	Shadow and Bone
1	40	jtkaufman737	The Ninth House
1	3	jtkaufman737	Infidel
1	24	jtkaufman737	The Big Machine

Look at how many records we have repeating that username! What happens if I want to update my username? What happens if there was a typo in the book name?

The reason it is better to stick to the ID numbers and keep this table small are:

there is less to update
IDs/key values don’t really change, ever, so we don’t care so much if they are repeated, its not as if it will ever lead to a bunch of updates
Integers are a small data type, so it is better for storage and quicker for searching

Overall, it is almost always the case that tables are not the correct place for combining your data into a convenient format. Tables need to be a pure source of truth. (SQL does give you an option for combining your tables “virtually” though, with something called a view).

Your data needs to be really, REALLY independent

At the level of a single record, and the value of a single column, we really need our data to stand on its own two feet for good data modeling. If you make a change in column A, it should not change the meaning of column B. Nor should the meaning of columns be different if you were to change their order. Either of these things is a sign that your data is not structured properly. You should not need to worry about having to update field a just because something was changed in field b. In that case, those data are not truly independent, and some logical concession needs to be made so that they can be treated like the related pieces of data they are.

Let’s look at another variant of our GoodReads data. Say we had a table like this:

book_id	genre_id	genre_type	author	published
1	2	Historical Fiction	Jane Eyre	1800
2	8	Sci-Fi	Bill Murray	2001

A change in the genre_id would also require a change to genre_type. These columns have a dependency, which means you cannot cleanly update one field, which in a perfect world you would be able to do no problem. For this reason, it is again, better to break up our data:

book table

book_id	genre_id	author	published
1	2	Jane Eyre	1800
2	8	Bill Murray	2001

genre table

genre_id	genre_name
1	Historical Fiction
8	Sci-Fi

Your data should be as regulated as possible

This is something that bites me from time to time in application code. The database permits more values than I would expect, so we have unexpected data variance. Maybe that means there are a mix of abbreviations and full state names in a table pertaining to addresses. Maybe the falsey value for a column can be either {} or NULL, so a condition I wrote looking for NULL is not behaving as expected, since it was really supposed to catch all falsey values (that one happened to me recently). Good ways to keep your data quality high:

Rely on lookup tables as much as possible to act as enums, forcing only data conforming to valid options to be created
Pick column data types that enforce quality: should an email field ever allow NULL values when users register for your site? Probably not. You can tell the database not to allow NULL values. There’s also no reason to pick a data type that allows text characters if the column should only ever expect numeric input.
You should use CASCADE options so that there is never “orphan” data, child data to some parent entity left over after the parent entity is deleted

In closing

There is much more to get into on this topic, but I sat down to write this for a newer frontend engineer with an upcoming data modeling interview and I wondered to myself “what are the bare basics that I’d want them to know?”

Data modeling, database design, and performance are deep, fascinating subjects in their own right and if you liked this article, there is a ton more to read and learn. But for someone new in the field juggling learning on multiple fronts, what I’ve outlined here is a perfectly solid foundation for a data modeling discussion while interviewing.

If you want to know more about the building blocks of SQL, I also gave a workshop which you can find here that discusses SQL basics and terminology more in depth, check it out! 🤓 I also highly recommend this tutorial on SQL commands and syntax if you would like some practice learning structured query language.