SeattleDataGuy

Posted on Feb 16, 2022

3 SQL Interview Tips For Data Scientists And Data Engineers

#sql #database #datascience #analytics

SQL has become a common skill requirement across industries and job profiles over the last decade.

Companies like Amazon and Google will often demand that their data analysts, data scientists, and product managers are at least be familiar with SQL. This is because SQL remains the language of data.

This has lead to SQL being a staple part of many data professionals interview loops. Meaning you better plan to include SQL in your study sessions. That being said, just walking through SQL problems doesn't always cut.

In this article I will go over three tips to help you improve your SQL both for interviews and in general.

Attempt To Solve A SQL Problem In Multiple Ways

There is more than one way to solve most SQL problems.

But often we resort to the methods we have most recently used.

For example, if you just learned about analytical functions it can pigeonhole a lot of your future solutions. This is why it is important to try to solve SQL problems in multiple ways.

In order to do all of our problems I will be using problem sets from InterviewQuery which is a site that can be used by data scientists to practice much more than SQL.

But for now let's look at a simple problem they offer.

We're given two tables, a users table with demographic information and the neighborhood they live in and a neighborhoods table.

Write a query that returns all neighborhoods that have 0 users.

The tables we have are listed below.

Here we have two input tables and an expected output. Before reading further do you have any thoughts on how to get neighborhood names that have 0 users?

What is your first answer?

Regardless, et's look at this first example of a possible solution.


SELECT N.name
FROM neighborhoods N 
LEFT JOIN users U 
ON N.id = U.neighborhood_id
WHERE U.id is null

This solution relies on the LEFT JOIN to return any neighborhoods without users having a u.id that is NULL.

But this example can be a little confusing and not as readable for some as it's not explicit. The LEFT JOIN kind of makes it a little difficult to read. The reason for this is that you will have to mentally manage the logic in your head.

Let's take a quick look at the table the LEFT JOIN would create "WITHOUT THE FILTER".

Above you will see that user_id has null values. This is because we are LEFT joining with the left table being neighborhood. Meaning that where there are no users, there will still be neighborhoods.

But, this is just one way to solve this problem. Let's write this query differently.

Have any ideas?

You can check out the query below which is the same thing really but in a more explicit way.

SELECT  n.name
FROM neighborhoods n 
LEFT JOIN users u 
ON n.id = u.neighborhood_id
GROUP BY n.name
HAVING COUNT(u.id) = 0

This query uses a more explicit HAVING clause to only count u.ids. COUNT only counts when u.id is not null. So in that way, the query truly operates the same way.

So if there isn't a matching ID for neighborhood id in the user table, then it will replace the user-id with null. Thus, if I write where user id is NULL, I will get the same response as COUNT(u.id).

Break Down Questions And Logic

Many questions you get will require you to break down the logic into multiple steps. A straightforward example is often when questions ask you to get the first instance of an event.

InterviewQuery has a question like this where they ask:

Given a table of song_plays and a table of users, write a query to extract the earliest date each user played their third unique song.

The table are listed below:

Here, they have written out the fact that they want the third "unique" song. This should be a dead giveaway that there are multiple instances of song plays that would mess up your solution if you were just to implement a quick ROW_NUMBER() function.

A great way to help simplify this is to start rotating some shapes.

To start let's create a data set that only got a unique instance of every play.

What is interesting here is I disagree with the solution provided by InterviewQuery, or at least I am not 100% behind it. To get a unique song and first play date you should be able to just use MIN() to get the first date played.

You can do this with the query below.

 SELECT song_name, user_id, min(created_at) created_at
 FROM song_plays
 GROUP BY song_name, user_id

As you can see, all we need to do is get the minimum play date for the song id and user. This cleans up any of the messy repeated plays in a row. Making the data much easier to work with.

InterviewQuery suggests using row_number() to get the first play of a song. I find this a little less clean as you will now be forced to do a second query to get the first value. Whereas with the MIN() method you have one query and you have shrunk the data set down.

But as I referenced earlier, there is more than one way to get a lot of SQL answers.

In no way do I hate the row_number() function. In fact, in the next step, we will be using it. You can see it in the query below.

with t1 as (
    SELECT song_name, user_id, min(created_at) created_at
    FROM song_plays
    GROUP BY song_name, user_id
)
  SELECT u.name, t1.song_name, t1.created_at,
  row_number() over (partition by u.name order by created_at) rn 
  FROM t1 
  JOIN users u 
  ON t1.user_id = u.id

From here, the rest should be self explanatory. We need to query for the 3rd row number. This is show below.

with t1 as (
    SELECT song_name, user_id, min(created_at) created_at
    FROM song_plays
    GROUP BY song_name, user_id
),
t2 as(
    SELECT u.name, t1.song_name, t1.created_at,
    row_number() over (partition by u.name order by created_at) rn 
    FROM t1 
    JOIN users u 
    ON t1.user_id = u.id 
)
SELECT name, song_name, created_at
FROM t2 
WHERE rn = 3

There is one more problem. This actually won't solve InterviewQuery's question.

This brings up the other issue I have with InterviewQuery's current solution for this problem. InterviewQuery wants us to return all users even if they didn't have a third song play. This wasn't referenced in the question and I don't see the value. But these are all NITs.

They have a different solution that uses a LEFT JOIN and then returns both users with and without a third song play. So do be aware of this finicky detail.

Learn How To Do None Standard Joins

Many people are accustomed to straightforward joins. With a basic "=".

This is great, but the truth is JOINs can be used like WHERE clauses.

There was a point where JOINs didn't exist and WHERE clauses were used as JOINs. I still occasionally see people who JOIN using a long string of WHERE clauses instead because they either work currently with older systems or learned SQL at a time JOINs weren't commonplace(explicit JOINs were added in SQL-92).

What this means is you can use JOINs in a lot of interesting ways. For example, if you have ever had to calculate a rolling sum without a window function, you will know you can use the "> and <" signs to create duplicate rows that you will then aggregate.

In our final question we are asked:

Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.

If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.

This is what the table looks like:

Take a moment to figure out how you might tackle this question.

Don't rush here. There are a lot of ways you could overcomplicate this question in particular.

The answer is pretty short.

But it all starts with you needing to create a combination of every student and every student's score.

Let's look at a possible solution and break down why it works.

SELECT 
t1.student as one_student,
t2.student as other_student,
abs(t1.score - t2.score) as score_diff
FROM scores t1
JOIN scores t2 
   ON  t1.id < t2.id
ORDER BY 3,1,2 
limit 1

So why does this work? We are telling the query to do a self join in all cases where the ID is smaller than the ID we are joining on.

Meaning we are creating a table that looks like the table below:

Now that we have a combination of all the possible student IDs we can just find the ABS() score different and properly set the order.

That's it.

One final point I will add is that I did end up eventually deciding to use the "!=" vs "<". The less than only seemed to work with the specific set of names and I am just not 100% sure if the names were changed that the ordering would work as expected. Using the "!=" guarantees that we will test every combination of names in every order. Meaning that if the more "alphabetic" solution were actually in a combination of "3--2" vs "2--3" we check and order both sets.

How Will You Take Your SQL To The Next Level

SQL looks like it is here to stay. It seems if anything it is picking up speed. The challenge is how do you take your SQL to the next level. It's not just about learning new syntax.

It's about learning how the data you're working on can be manipulated using even simple clauses. In doing so you will be creating SQL at a much higher level and better prepare you for your SQL interviews.

Whether you're a data scientists, data engineer or analysts I wish you the best of luck!

If you enjoyed this article, then check out some of my other content:

Which Managed Version Of Airflow Should You Use?

5 Tips To Become A Great Analyst

What Is Trino And How It Manages Big Data

What I Learned From 100+ Data Engineering Interviews - Interview Tips