One of the first tasks performed when doing data analytics is to create clean the dataset you're working with. The insights you draw from your data are only as good as the data itself, so it's no surprise that an estimated 80% of the time spent by analytics professionals involves preparing data for use in analysis.
SQL can help expedite this important task. In this tutorial, we will discuss different functions commonly used to clean, transform, and remove duplicate data from query outputs that may not be in the form we would like. This means you'll learn about:
CASE WHEN
COALESCE
NULLIF
-
LEAST
/GREATEST
- Casting
DISTINCT
We will be using the following sample table, employees
, throughout this tutorial to illustrate how our functions work:
id | first_name | last_name | title | age | wage | hire_date |
---|---|---|---|---|---|---|
1 | Amy | Jordan | Ms | 24 | 15 | 2019-04-27 |
2 | Bill | Tibb | Mr | 61 | 28 | 2012-05-02 |
3 | Bill | Sadat | 18 | 12 | 2019-11-08 | |
4 | Christine | Riveles | Mrs | 36 | 20 | 2018-03-30 |
5 | David | Guerin | Honorable | 28 | 20 | 2016-11-02 |
This data is preloaded into a Next Tech sandbox for you to experiment and test the below queries. Connect to the database here for free!
Let’s get started!
This tutorial is adapted from Next Tech’s full SQL for Data Analysis course, which includes an in-browser sandboxed environment and interactive activities and challenges using real datasets. You can get started with this course here!
CASE WHEN
CASE WHEN
is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN
statement is:
CASE
WHEN condition1 THEN value1
WHEN condition2 THEN value2
...
WHEN conditionX THEN valueX
ELSE else_value
END
Here, condition1
and condition2
, through conditionX
, are Boolean conditions; value1
and value2
, through valueX
, are values to map the Boolean conditions; and else_value
is the value that is mapped if none of the Boolean conditions are met.
For each row, the program starts at the top of the CASE WHEN
statement and evaluates the first Boolean condition. The program then runs through each Boolean condition from the first one. For the first condition from the start of the statement that evaluates as true, the statement will return the value associated with that condition. If none of the statements evaluate as true, then the value associated with the ELSE
statement will be returned.
As an example, let's say you wanted to return all rows for employees from the employees
table. Additionally, you would like to add a column that labels an employee as being an New
employee if they were hired after 2019-01-01. Otherwise, it will mark the employee as a Standard
employee. This column will be called employee_type
. We can create this table by using a CASE WHEN
statement as follows:
SELECT
*,
CASE
WHEN hire_date >= '2019-01-01' THEN 'New'
ELSE 'Standard'
END AS employee_type
FROM
employees;
This query will give the following output:
id | first_name | last_name | title | age | wage | hire_date | employee_type |
---|---|---|---|---|---|---|---|
1 | Amy | Jordan | Ms | 24 | 15 | 2019-04-27 | New |
2 | Bill | Tibb | Mr | 61 | 28 | 2012-05-02 | Standard |
3 | Bill | Sadat | 18 | 12 | 2019-11-08 | New | |
4 | Christine | Riveles | Mrs | 36 | 20 | 2018-03-30 | Standard |
5 | David | Guerin | Honorable | 28 | 20 | 2016-11-02 | Standard |
The CASE WHEN
statement effectively mapped a hire date to a string describing the employee type. Using a CASE WHEN
statement, you can map values in any way you please.
COALESCE
Another useful technique is to replace NULL
values with a standard value. This can be accomplished easily by means of the COALESCE
function. COALESCE
allows you to list any number of columns and scalar values, and, if the first value in the list is NULL
, it will try to fill it in with the second value. The COALESCE
function will keep continuing down the list of values until it hits a non-NULL
value. If all values in the COALESCE
function are NULL
, then the function returns NULL
.
To illustrate a simple usage of the COALESCE
function, let's say we want a list of the names and titles of our employees. However, for those with no title, we want to instead write the value 'NO TITLE'
. We can accomplish this request with COALESCE
:
SELECT
first_name,
last_name,
COALESCE(title, 'NO TITLE') AS title
FROM
employees;
This query produces the following results:
first_name | last_name | title |
---|---|---|
Amy | Jordan | Ms |
Bill | Tibb | Mr |
Bill | Sadat | NO TITLE |
Christine | Riveles | Mrs |
David | Guerin | Honorable |
When dealing with creating default values and avoiding NULL
, COALESCE
will always be helpful.
NULLIF
NULLIF
is, in a sense, the opposite of COALESCE
. NULLIF
is a two-value function and will return NULL
if the first value equals the second value.
As an example, imagine that we want a list of the names and titles of our employees. However, this time, we want to replace the title 'Honorable'
with NULL
. This could be done with the following query:
SELECT
first_name,
last_name,
NULLIF(title, 'Honorable') AS title
FROM
employees;
This will blot out all mentions of 'Honorable'
from the title
column and give the following output:
first_name | last_name | title |
---|---|---|
Amy | Jordan | Ms |
Bill | Tibb | Mr |
Bill | Sadat | |
Christine | Riveles | Mrs |
David | Guerin |
LEAST / GREATEST
Two functions often come in handy for data preparation are the LEAST
and GREATEST
functions. Each function takes any number of values and returns the least or the greatest of the values, respectively.
A simple use of this variable would be to replace the value if it's too high or low. For example, say the minimum wage increased to $15/hour and we need to change the wages of any employee earning less than that. We can create this using the following query:
SELECT
id,
first_name,
last_name,
title,
age,
GREATEST(15, wage) as wage,
hire_date
FROM
employees;
This query will give the following output:
id | first_name | last_name | title | age | wage | hire_date |
---|---|---|---|---|---|---|
1 | Amy | Jordan | Ms | 24 | 15 | 2019-04-27 |
2 | Bill | Tibb | Mr | 61 | 28 | 2012-05-02 |
3 | Bill | Sadat | 18 | 15 | 2019-11-08 | |
4 | Christine | Riveles | Mrs | 36 | 20 | 2018-03-30 |
5 | David | Guerin | Honorable | 28 | 20 | 2016-11-02 |
As you can see, Bill Sadat’s wage has increased from $12 to $15.
Casting
Another useful data transformation is to change the data type of a column within a query. This is usually done to use a function only available to one data type, such as text, while working with a column that is in a different data type, such as a numeric.
To change the data type of a column, you simply need to use the column::datatype
format, where column
is the column name, and datatype
is the data type you want to change the column to. For example, to change the age in the employees
table to a text column in a query, use the following query:
SELECT
first_name,
last_name,
age::TEXT
FROM
employees;
This will convert the age
column from an integer to text. You can now apply text functions to this transformed column. There is one final catch; not every data type can be cast to a specific data type. For instance, datetime
cannot be cast to float types. Your SQL client will throw an error if you ever make an unexpected strange conversion.
DISTINCT
Often, when looking through a dataset, you may be interested in determining the unique values in a column or group of columns. This is the primary use case of the DISTINCT
keyword. For example, if you wanted to know all the unique first names in the employees
table, you could use the following query:
SELECT
DISTINCT first_name
FROM
employees;
This gives the following result:
first_name |
---|
Amy |
Bill |
Christine |
David |
You can also use DISTINCT
with multiple columns to get all distinct column combinations present.
I hope you enjoyed this tutorial on data cleaning and transformation with SQL. This is just the beginning of what you can use SQL for in data analysis. If you’d like to learn more, Next Tech’s SQL for Data Analysis course covers:
- More functions used for data preparation and cleaning
- Aggregate functions and window functions
- Importing and exporting data
- Analytics using complex data types
- Writing performant queries
You can get started for free here!
Top comments (0)