One of the first tasks performed when doing data analytics is to create clean the dataset you're working with. The insights you draw from your data are only as good as the data itself, so it's no surprise that an estimated 80% of the time spent by analytics professionals involves preparing data for use in analysis.
SQL can help expedite this important task. In this tutorial, we will discuss different functions commonly used to clean, transform, and remove duplicate data from query outputs that may not be in the form we would like. This means you'll learn about:
We will be using the following sample table,
employees, throughout this tutorial to illustrate how our functions work:
This data is preloaded into a Next Tech sandbox for you to experiment and test the below queries. Connect to the database here for free!
Let’s get started!
This tutorial is adapted from Next Tech’s full SQL for Data Analysis course, which includes an in-browser sandboxed environment and interactive activities and challenges using real datasets. You can get started with this course here!
CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a
CASE WHEN statement is:
CASE WHEN condition1 THEN value1 WHEN condition2 THEN value2 ... WHEN conditionX THEN valueX ELSE else_value END
conditionX, are Boolean conditions;
valueX, are values to map the Boolean conditions; and
else_value is the value that is mapped if none of the Boolean conditions are met.
For each row, the program starts at the top of the
CASE WHEN statement and evaluates the first Boolean condition. The program then runs through each Boolean condition from the first one. For the first condition from the start of the statement that evaluates as true, the statement will return the value associated with that condition. If none of the statements evaluate as true, then the value associated with the
ELSE statement will be returned.
As an example, let's say you wanted to return all rows for employees from the
employees table. Additionally, you would like to add a column that labels an employee as being an
New employee if they were hired after 2019-01-01. Otherwise, it will mark the employee as a
Standard employee. This column will be called
employee_type. We can create this table by using a
CASE WHEN statement as follows:
SELECT *, CASE WHEN hire_date >= '2019-01-01' THEN 'New' ELSE 'Standard' END AS employee_type FROM employees;
This query will give the following output:
CASE WHEN statement effectively mapped a hire date to a string describing the employee type. Using a
CASE WHEN statement, you can map values in any way you please.
Another useful technique is to replace
NULL values with a standard value. This can be accomplished easily by means of the
COALESCE allows you to list any number of columns and scalar values, and, if the first value in the list is
NULL, it will try to fill it in with the second value. The
COALESCE function will keep continuing down the list of values until it hits a
non-NULL value. If all values in the
COALESCE function are
NULL, then the function returns
To illustrate a simple usage of the
COALESCE function, let's say we want a list of the names and titles of our employees. However, for those with no title, we want to instead write the value
'NO TITLE'. We can accomplish this request with
SELECT first_name, last_name, COALESCE(title, 'NO TITLE') AS title FROM employees;
This query produces the following results:
When dealing with creating default values and avoiding
COALESCE will always be helpful.
NULLIF is, in a sense, the opposite of
NULLIF is a two-value function and will return
NULL if the first value equals the second value.
As an example, imagine that we want a list of the names and titles of our employees. However, this time, we want to replace the title
NULL. This could be done with the following query:
SELECT first_name, last_name, NULLIF(title, 'Honorable') AS title FROM employees;
This will blot out all mentions of
'Honorable' from the
title column and give the following output:
Two functions often come in handy for data preparation are the
GREATEST functions. Each function takes any number of values and returns the least or the greatest of the values, respectively.
A simple use of this variable would be to replace the value if it's too high or low. For example, say the minimum wage increased to $15/hour and we need to change the wages of any employee earning less than that. We can create this using the following query:
SELECT id, first_name, last_name, title, age, GREATEST(15, wage) as wage, hire_date FROM employees;
This query will give the following output:
As you can see, Bill Sadat’s wage has increased from $12 to $15.
Another useful data transformation is to change the data type of a column within a query. This is usually done to use a function only available to one data type, such as text, while working with a column that is in a different data type, such as a numeric.
To change the data type of a column, you simply need to use the
column::datatype format, where
column is the column name, and
datatype is the data type you want to change the column to. For example, to change the age in the
employees table to a text column in a query, use the following query:
SELECT first_name, last_name, age::TEXT FROM employees;
This will convert the
age column from an integer to text. You can now apply text functions to this transformed column. There is one final catch; not every data type can be cast to a specific data type. For instance,
datetime cannot be cast to float types. Your SQL client will throw an error if you ever make an unexpected strange conversion.
Often, when looking through a dataset, you may be interested in determining the unique values in a column or group of columns. This is the primary use case of the
DISTINCT keyword. For example, if you wanted to know all the unique first names in the
employees table, you could use the following query:
SELECT DISTINCT first_name FROM employees;
This gives the following result:
You can also use
DISTINCT with multiple columns to get all distinct column combinations present.
I hope you enjoyed this tutorial on data cleaning and transformation with SQL. This is just the beginning of what you can use SQL for in data analysis. If you’d like to learn more, Next Tech’s SQL for Data Analysis course covers:
- More functions used for data preparation and cleaning
- Aggregate functions and window functions
- Importing and exporting data
- Analytics using complex data types
- Writing performant queries
You can get started for free here!