loading...
Cover image for Cleaning and Transforming Data with SQL
Next Tech

Cleaning and Transforming Data with SQL

lorrli274 profile image Lorraine ・6 min read

One of the first tasks performed when doing data analytics is to create clean the dataset you're working with. The insights you draw from your data are only as good as the data itself, so it's no surprise that an estimated 80% of the time spent by analytics professionals involves preparing data for use in analysis.

SQL can help expedite this important task. In this tutorial, we will discuss different functions commonly used to clean, transform, and remove duplicate data from query outputs that may not be in the form we would like. This means you'll learn about:

  • CASE WHEN
  • COALESCE
  • NULLIF
  • LEAST / GREATEST
  • Casting
  • DISTINCT

We will be using the following sample table, employees, throughout this tutorial to illustrate how our functions work:

id first_name last_name title age wage hire_date
1 Amy Jordan Ms 24 15 2019-04-27
2 Bill Tibb Mr 61 28 2012-05-02
3 Bill Sadat 18 12 2019-11-08
4 Christine Riveles Mrs 36 20 2018-03-30
5 David Guerin Honorable 28 20 2016-11-02

This data is preloaded into a Next Tech sandbox for you to experiment and test the below queries. Connect to the database here for free!

Let’s get started!

This tutorial is adapted from Next Tech’s full SQL for Data Analysis course, which includes an in-browser sandboxed environment and interactive activities and challenges using real datasets. You can get started with this course here!


CASE WHEN

CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN statement is:

CASE
    WHEN condition1 THEN value1
    WHEN condition2 THEN value2
    ...
    WHEN conditionX THEN valueX
    ELSE else_value
END

Here, condition1 and condition2, through conditionX, are Boolean conditions; value1 and value2, through valueX, are values to map the Boolean conditions; and else_value is the value that is mapped if none of the Boolean conditions are met.

For each row, the program starts at the top of the CASE WHEN statement and evaluates the first Boolean condition. The program then runs through each Boolean condition from the first one. For the first condition from the start of the statement that evaluates as true, the statement will return the value associated with that condition. If none of the statements evaluate as true, then the value associated with the ELSE statement will be returned.

As an example, let's say you wanted to return all rows for employees from the employees table. Additionally, you would like to add a column that labels an employee as being an New employee if they were hired after 2019-01-01. Otherwise, it will mark the employee as a Standard employee. This column will be called employee_type. We can create this table by using a CASE WHEN statement as follows:

SELECT
    *,
    CASE
        WHEN hire_date >= '2019-01-01' THEN 'New'
        ELSE 'Standard'
    END AS employee_type
FROM
    employees;

This query will give the following output:

id first_name last_name title age wage hire_date employee_type
1 Amy Jordan Ms 24 15 2019-04-27 New
2 Bill Tibb Mr 61 28 2012-05-02 Standard
3 Bill Sadat 18 12 2019-11-08 New
4 Christine Riveles Mrs 36 20 2018-03-30 Standard
5 David Guerin Honorable 28 20 2016-11-02 Standard

The CASE WHEN statement effectively mapped a hire date to a string describing the employee type. Using a CASE WHEN statement, you can map values in any way you please.

COALESCE

Another useful technique is to replace NULL values with a standard value. This can be accomplished easily by means of the COALESCE function. COALESCE allows you to list any number of columns and scalar values, and, if the first value in the list is NULL, it will try to fill it in with the second value. The COALESCE function will keep continuing down the list of values until it hits a non-NULL value. If all values in the COALESCE function are NULL, then the function returns NULL.

To illustrate a simple usage of the COALESCE function, let's say we want a list of the names and titles of our employees. However, for those with no title, we want to instead write the value 'NO TITLE'. We can accomplish this request with COALESCE:

SELECT
    first_name,
    last_name,
    COALESCE(title, 'NO TITLE') AS title
FROM
    employees;

This query produces the following results:

first_name last_name title
Amy Jordan Ms
Bill Tibb Mr
Bill Sadat NO TITLE
Christine Riveles Mrs
David Guerin Honorable

When dealing with creating default values and avoiding NULL, COALESCE will always be helpful.

NULLIF

NULLIF is, in a sense, the opposite of COALESCE. NULLIF is a two-value function and will return NULL if the first value equals the second value.

As an example, imagine that we want a list of the names and titles of our employees. However, this time, we want to replace the title 'Honorable' with NULL. This could be done with the following query:

SELECT
    first_name,
    last_name,
    NULLIF(title, 'Honorable') AS title
FROM
    employees;

This will blot out all mentions of 'Honorable' from the title column and give the following output:

first_name last_name title
Amy Jordan Ms
Bill Tibb Mr
Bill Sadat
Christine Riveles Mrs
David Guerin

LEAST / GREATEST

Two functions often come in handy for data preparation are the LEAST and GREATEST functions. Each function takes any number of values and returns the least or the greatest of the values, respectively.

A simple use of this variable would be to replace the value if it's too high or low. For example, say the minimum wage increased to $15/hour and we need to change the wages of any employee earning less than that. We can create this using the following query:

SELECT
    id,
    first_name,
    last_name,
    title,
    age,
    GREATEST(15, wage) as wage,
    hire_date
FROM
    employees;

This query will give the following output:

id first_name last_name title age wage hire_date
1 Amy Jordan Ms 24 15 2019-04-27
2 Bill Tibb Mr 61 28 2012-05-02
3 Bill Sadat 18 15 2019-11-08
4 Christine Riveles Mrs 36 20 2018-03-30
5 David Guerin Honorable 28 20 2016-11-02

As you can see, Bill Sadat’s wage has increased from $12 to $15.

Casting

Another useful data transformation is to change the data type of a column within a query. This is usually done to use a function only available to one data type, such as text, while working with a column that is in a different data type, such as a numeric.

To change the data type of a column, you simply need to use the column::datatype format, where column is the column name, and datatype is the data type you want to change the column to. For example, to change the age in the employees table to a text column in a query, use the following query:

SELECT
    first_name,
    last_name,
    age::TEXT
FROM
    employees;

This will convert the age column from an integer to text. You can now apply text functions to this transformed column. There is one final catch; not every data type can be cast to a specific data type. For instance, datetime cannot be cast to float types. Your SQL client will throw an error if you ever make an unexpected strange conversion.

DISTINCT

Often, when looking through a dataset, you may be interested in determining the unique values in a column or group of columns. This is the primary use case of the DISTINCT keyword. For example, if you wanted to know all the unique first names in the employees table, you could use the following query:

SELECT
    DISTINCT first_name
FROM
    employees;

This gives the following result:

first_name
Amy
Bill
Christine
David

You can also use DISTINCT with multiple columns to get all distinct column combinations present.


I hope you enjoyed this tutorial on data cleaning and transformation with SQL. This is just the beginning of what you can use SQL for in data analysis. If you’d like to learn more, Next Tech’s SQL for Data Analysis course covers:

  • More functions used for data preparation and cleaning
  • Aggregate functions and window functions
  • Importing and exporting data
  • Analytics using complex data types
  • Writing performant queries

You can get started for free here!

Posted on by:

lorrli274 profile

Lorraine

@lorrli274

Data Scientist @ Next Tech

Next Tech

Helping you build what's next in tech.

Discussion

markdown guide