SeattleDataGuy

Posted on Jun 29, 2020

How To Write Better SQL: Advanced SQL Episode 1

#sql #datascience #database

Photo by Caspar Camille Rubin on Unsplash

SQL remains the language for data. Developed back in the 1970s, it is one of the few technologies that has remained constant. Regardless of what drag and drop tools come around or what new query paradigms try to overtake it.

SQL remains the most widely used technologies to interact with data. Even with the advent of databases that utilize NOSQL or (Not Only SQL), but layers such as Presto and Hive have also been developed on top of them to provide a friendly SQL interact.

Not only that.

But the usage of SQL has far expanded just data engineers and analysts.

Product managers, analytical partners, and software engineers at large tech companies commonly use SQL to access data and answer questions quickly.

The point being, SQL is worth knowing.

But once you know the basics of SQL, how do you progress?

What takes a SQL user from novice to advanced?

During the past few years, we have spent a lot of time writing SQL for data pipelines, dashboards, data products, and other odds and ends.

We don't think advanced SQL is about syntax. There aren't too many fancy clauses after you learn about analytic clauses.

Sure, you can loop in SQL and even edit files.

However, all of that is also actions that can occur in code.

So what separates basic SQL users from advanced SQL users?

We believe it's more about thinking big picture. Advanced SQL developers think long-term vs short-term. They develop SQL that is maintainable, easy to read, and requires more time and consideration.

In this article, we will focus on a lot of the design decisions that we believe separate novice SQL developers from senior and advanced SQL developers.

You will notice that this will go beyond just SQL, a lot of it will go into more conceptual problems where there aren't definite answers to the best solutions.

The format of the tips will go to problem or behavior and solution or improved methods. Truth be told, some of the solutions could be considered design preferences.

We are sure some of you might even disagree with the tips we give here. Please provide comments and your beliefs. We would love to discuss them further. You can also watch the video version here.

And with that, let's get into learning!

Putting Logic In Your SQL vs In A Table

Case clauses are very useful. However, there are two specific ways that some people use them that can bite you later in the future.

First, duplicative logic.

What do we mean by duplicative logic? We mean using the same case statement in multiple queries, views, and pipelines.

For example, let's look at the below. We have the same logic in multiple places. In this case in an ad-hoc query that is used in a data analyses analysis they repeat every month and in a BI developers Tableau dashboard.

--this query is being used in a tableau dashboard by a BI Developer
SELECT patient_id
    ,total_cost DATE
    ,CASE 
        WHEN PROCEDURE_CODE = BETWEEN '9990'
                AND '10000'
            AND age BETWEEN 30
                AND 40
            THEN 'category 1'
        WHEN PROCEDURE_CODE = BETWEEN '9980'
                AND '9990'
            AND age BETWEEN 40
                AND 50
            THEN 'category 2'
        ELSE 'No Category'
        END
FROM patient_claims


-- this query is an adhoc query just meant to filter out category 2 by a data analyst on a different team
SELECT patient_id
    ,total_cost DATE
FROM patient_claims
WHERE PROCEDURE_CODE = BETWEEN '9980'
        AND '9990'
    AND age BETWEEN 40
        AND 50

Now, let's imagine the logic for the category needs to be updated for any reason. You might not even know all the places an update needs to occur. What if different teams are managing all of these different queries.

Who is going to tell them to update it and how will you inform them of the change?

Not every company has great tools to help find repetitive logic like this.

Another similar issue is engineers decide they want to use case statements as enumerations vs using a table.
For example let's look at the query below.

SELECT CASE 
        WHEN id = 1
            THEN 'category 1'
        WHEN id = 2
            THEN 'category 2'
        WHEN id = 3
            THEN 'category 3'
        WHEN id = 4
            THEN 'category 4'
        WHEN id = 5
            THEN 'category 5'
        WHEN id = 6
            THEN 'category 6'
        WHEN id = 7
            THEN 'category 7'
        WHEN id = 8
            THEN 'category 8'
        WHEN id IN (
                9
                ,10
                ,11
                )
            THEN 'category 9'
        END
FROM dim_procedures

Solution

Instead, in this case you should consider using a table. This allows you to add more IDs to the table without changing your code. This is cleaner and easier to test with out making a mistake

This goes for both issues. Whether you are having duplicative logic like in the first query or using your SQL to act like a table.

In both cases you should consider the trade-offs of creating the table.

It is important to note that there are trade-offs for this solution. Adding a new table means new code to maintain. Part of becoming more advanced at SQL is thinking in the mindset of trade-offs.

Stay-tuned for our next article where we talk design trade-offs.

How Will You Take Your SQL To The Next Level

SQL looks like it is here to stay. In fact, it seems if anything it is picking up speed. The challenge is how do you take your SQL to the next level. It's not just about learning new syntax.

It's about learning how to make your SQL more maintainable, readable, and trustable. In doing so you will be creating SQL at a much higher level. You will be creating data sets, dashboards, and metrics that everyone in the company can rely on for more than just a few months.

More and more roles in companies are needing to become at least conversationally fluent in SQL. From product managers to data analysts. The ability to extract insights from data is valuable in any position.

If you enjoyed this content consider checking out some of our other great videos and posts.

Automate Data Extracts From Google Sheets With Python

4 Simple Python Ideas To Automate Your Workflow

Data Engineering 101: An Introduction To Data Engineering

What Are The Different Kinds Of Cloud Computing

5 Great Libraries To Manage Big Data With Python

Joining Data in DynamoDB and S3 for Live Ad Hoc Analysis

Top comments (1)

Maxi Di Rienzo • Jul 1 '20

I believe that for the provided cases -it's clearer in the 1st example where you have ranges, instead of the last 3 IDs that maps to the same label in the 2nd example- a much better solution would be to use a function instead of one or two (many to many) tables, this of course could be coded in the same DB engine or in the app layer.