## DEV Community

Lorraine for Next Tech

Posted on

# Cleaning and Transforming Data with SQL

One of the first tasks performed when doing data analytics is to create clean the dataset you're working with. The insights you draw from your data are only as good as the data itself, so it's no surprise that an estimated 80% of the time spent by analytics professionals involves preparing data for use in analysis.

SQL can help expedite this important task. In this tutorial, we will discuss different functions commonly used to clean, transform, and remove duplicate data from query outputs that may not be in the form we would like. This means you'll learn about:

• `CASE WHEN`
• `COALESCE`
• `NULLIF`
• `LEAST` / `GREATEST`
• Casting
• `DISTINCT`

We will be using the following sample table, `employees`, throughout this tutorial to illustrate how our functions work:

id first_name last_name title age wage hire_date
1 Amy Jordan Ms 24 15 2019-04-27
2 Bill Tibb Mr 61 28 2012-05-02
3 Bill Sadat 18 12 2019-11-08
4 Christine Riveles Mrs 36 20 2018-03-30
5 David Guerin Honorable 28 20 2016-11-02

This data is preloaded into a Next Tech sandbox for you to experiment and test the below queries. Connect to the database here for free!

Let’s get started!

This tutorial is adapted from Next Tech’s full SQL for Data Analysis course, which includes an in-browser sandboxed environment and interactive activities and challenges using real datasets. You can get started with this course here!

## `CASE WHEN`

`CASE WHEN` is a function that allows a query to map various values in a column to other values. The general format of a `CASE WHEN` statement is:

``````CASE
WHEN condition1 THEN value1
WHEN condition2 THEN value2
...
WHEN conditionX THEN valueX
ELSE else_value
END
``````

Here, `condition1` and `condition2`, through `conditionX`, are Boolean conditions; `value1` and `value2`, through `valueX`, are values to map the Boolean conditions; and `else_value` is the value that is mapped if none of the Boolean conditions are met.

For each row, the program starts at the top of the `CASE WHEN` statement and evaluates the first Boolean condition. The program then runs through each Boolean condition from the first one. For the first condition from the start of the statement that evaluates as true, the statement will return the value associated with that condition. If none of the statements evaluate as true, then the value associated with the `ELSE` statement will be returned.

As an example, let's say you wanted to return all rows for employees from the `employees` table. Additionally, you would like to add a column that labels an employee as being an `New` employee if they were hired after 2019-01-01. Otherwise, it will mark the employee as a `Standard` employee. This column will be called `employee_type`. We can create this table by using a `CASE WHEN` statement as follows:

``````SELECT
*,
CASE
WHEN hire_date >= '2019-01-01' THEN 'New'
ELSE 'Standard'
END AS employee_type
FROM
employees;
``````

This query will give the following output:

id first_name last_name title age wage hire_date employee_type
1 Amy Jordan Ms 24 15 2019-04-27 New
2 Bill Tibb Mr 61 28 2012-05-02 Standard
3 Bill Sadat 18 12 2019-11-08 New
4 Christine Riveles Mrs 36 20 2018-03-30 Standard
5 David Guerin Honorable 28 20 2016-11-02 Standard

The `CASE WHEN` statement effectively mapped a hire date to a string describing the employee type. Using a `CASE WHEN` statement, you can map values in any way you please.

## COALESCE

Another useful technique is to replace `NULL` values with a standard value. This can be accomplished easily by means of the `COALESCE` function. `COALESCE` allows you to list any number of columns and scalar values, and, if the first value in the list is `NULL`, it will try to fill it in with the second value. The `COALESCE` function will keep continuing down the list of values until it hits a `non-NULL` value. If all values in the `COALESCE` function are `NULL`, then the function returns `NULL`.

To illustrate a simple usage of the `COALESCE` function, let's say we want a list of the names and titles of our employees. However, for those with no title, we want to instead write the value `'NO TITLE'`. We can accomplish this request with `COALESCE`:

``````SELECT
first_name,
last_name,
COALESCE(title, 'NO TITLE') AS title
FROM
employees;
``````

This query produces the following results:

first_name last_name title
Amy Jordan Ms
Bill Tibb Mr
Christine Riveles Mrs
David Guerin Honorable

When dealing with creating default values and avoiding `NULL`, `COALESCE` will always be helpful.

## NULLIF

`NULLIF` is, in a sense, the opposite of `COALESCE`. `NULLIF` is a two-value function and will return `NULL` if the first value equals the second value.

As an example, imagine that we want a list of the names and titles of our employees. However, this time, we want to replace the title `'Honorable'` with `NULL`. This could be done with the following query:

``````SELECT
first_name,
last_name,
NULLIF(title, 'Honorable') AS title
FROM
employees;
``````

This will blot out all mentions of `'Honorable'` from the `title` column and give the following output:

first_name last_name title
Amy Jordan Ms
Bill Tibb Mr
Christine Riveles Mrs
David Guerin

## LEAST / GREATEST

Two functions often come in handy for data preparation are the `LEAST` and `GREATEST` functions. Each function takes any number of values and returns the least or the greatest of the values, respectively.

A simple use of this variable would be to replace the value if it's too high or low. For example, say the minimum wage increased to \$15/hour and we need to change the wages of any employee earning less than that. We can create this using the following query:

``````SELECT
id,
first_name,
last_name,
title,
age,
GREATEST(15, wage) as wage,
hire_date
FROM
employees;
``````

This query will give the following output:

id first_name last_name title age wage hire_date
1 Amy Jordan Ms 24 15 2019-04-27
2 Bill Tibb Mr 61 28 2012-05-02
3 Bill Sadat 18 15 2019-11-08
4 Christine Riveles Mrs 36 20 2018-03-30
5 David Guerin Honorable 28 20 2016-11-02

As you can see, Bill Sadat’s wage has increased from \$12 to \$15.

## Casting

Another useful data transformation is to change the data type of a column within a query. This is usually done to use a function only available to one data type, such as text, while working with a column that is in a different data type, such as a numeric.

To change the data type of a column, you simply need to use the `column::datatype` format, where `column` is the column name, and `datatype` is the data type you want to change the column to. For example, to change the age in the `employees` table to a text column in a query, use the following query:

``````SELECT
first_name,
last_name,
age::TEXT
FROM
employees;
``````

This will convert the `age` column from an integer to text. You can now apply text functions to this transformed column. There is one final catch; not every data type can be cast to a specific data type. For instance, `datetime` cannot be cast to float types. Your SQL client will throw an error if you ever make an unexpected strange conversion.

## DISTINCT

Often, when looking through a dataset, you may be interested in determining the unique values in a column or group of columns. This is the primary use case of the `DISTINCT` keyword. For example, if you wanted to know all the unique first names in the `employees` table, you could use the following query:

``````SELECT
DISTINCT first_name
FROM
employees;
``````

This gives the following result:

first_name
Amy
Bill
Christine
David

You can also use `DISTINCT` with multiple columns to get all distinct column combinations present.

I hope you enjoyed this tutorial on data cleaning and transformation with SQL. This is just the beginning of what you can use SQL for in data analysis. If you’d like to learn more, Next Tech’s SQL for Data Analysis course covers:

• More functions used for data preparation and cleaning
• Aggregate functions and window functions
• Importing and exporting data
• Analytics using complex data types
• Writing performant queries