Pre-requisites: This article assumes basic SQL knowledge and CRUD commands such as:
SQL, Structured Query Language, is a programming language used for manipulating and managing data in a relational database. Data Scientists use it to extract insights from data. A large amount of data used by data scientists lives in a relational database. This data can be extracted using SQL commands. SQL servers such as MySQL and PostgreSQL use SQL.
This article covers the essential SQL commands that data scientists rely on to effectively clean and filter data:
- Data Retrieval
- Conditions for Data Retrieval
- Data Aggregation
- Changing Data Types
- Joining Data From Different Tables
- Complex Conditions
This is the simplest method of data retrieval in a relational database.
It can be combined with conditional statements such as WHERE, ORDER BY, and GROUP BY to filter, sort, and group data.
-- To select specific columns in a table: SELECT column1, column2, column3 FROM table_name; -- To select everything in a table: SELECT * FROM table_name;
DISTINCT is used with SELECT to view unique values in a column.
For example, to know all the departments appearing in the column
department, we use DISTINCT. It returns a table of the departments appearing in that table.
SELECT DISTINCT department FROM employees;
This is a conditional statement used to filter data according to a specific condition.
SELECT column1, column2, column3 FROM table_name WHERE condition; -- for example: SELECT * FROM employees WHERE age >= 45; -- We can also filter data with more than one condition: SELECT employee_name, department, salary FROM employees WHERE department = 'Sales' AND salary >= 50000; SELECT * FROM employees WHERE department IN ('Finance', 'IT', 'HR');
This statement is used to group data based on one or more columns.
SELECT department, salary FROM employees GROUP BY department;
This is used to sort the results of a query either alphabetically or numerically.
The default sorting order in sql is
ASC. Therefore, you do not have to specify
ASC in your query.
SELECT employee, salary FROM employees ORDER BY salary;
However, to sort the results in a descending order, use the keyword
SELECT employee, salary FROM employees ORDER BY salary DESC;
When the records in a table are many, we may want to limit the number of records we get. For example, to view only the top 10 earners in the Finance department:
SELECT employee_name, department, salary FROM employees WHERE salary > 50000 ORDER BY salary LIMIT 10;
Aggregations are summaries of data used to gain insights on a dataset. They are often used with the GROUP BY clause.
Count returns the total number of rows. In the example below, we are displaying the number of employees in each department.
SELECT COUNT(employee_id) FROM employees GROUP BY department;
SUM() & AVG()
Sum returns the sum of all the values. In the example below, we use the GROUP BY statement to group the employees by department and calculate the total salary for each department:
SELECT department, SUM(salary) as total_salary FROM employees GROUP BY department;
Avg returns the average value. In the example below, we use the GROUP BY statement to group the employees by department and calculate the average salary for each department:
SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department;
Having is used to add additional conditions after calculating a grouped aggregation.
For example, the above query can be conditioned further to only show departments with an average salary above 50000.
SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department HAVING AVG(salary) > 50000;
MIN() & MAX()
To know the lowest or highest values in a column, we can use
SELECT MIN(salary) AS lowest_salary FROM employees; SELECT MAX(salary) AS highest_salary FROM employees;
SQL sees numeric values as numbers even when dealing with money. We can change
salary values to dollar amounts using the CAST function:
SELECT department, CAST(SUM(salary) as money) FROM employees GROUP BY department ORDER BY SUM(salary) DESC;
We can also change numbers into floats, text, or date and time.
When aggregations cause many decimal points, we can round off the decimal points:
SELECT department, ROUND(AVG(salary), 2) as avg_salary FROM employees GROUP BY department;
Working with a single table limits the number of manipulations we can do with data. This is where JOINs come in. We are able to join data from multiple tables.
Before we go any further, we need to distinguish a promary key from a foreign key. A primary key is a column used to uniquely identify records in a table. For example, the primary key in the
employees table is employee_id. On the other hand, a foreign key is used to relate two tables.
A foreign key is usually a primary key in the other table. A separate table having information about when employees take vacation days (
employee_vacation table) will have a column
employee_id to relate to the employee table. Therefore, employee_id is a primary key in the employee table but a foreign key in the employee_vacation table.
There are different types of SQL joins which are best illustrated using venn diagrams.
The following examples will feature a customer database with
An inner join is used to view data where records in two tables match on two columns. The example below shows the order_id and customer_name
if the customer_id on the orders table and the customer_id on the customers table are the same.
SELECT orders.order_id, customers.customer_name FROM orders INNER JOIN customers ON orders.customer_id = customers.customer_id;
An INNER JOIN is also known as a JOIN and therefore, the code above can be written as:
SELECT orders.order_id, customers.customer_name FROM orders JOIN customers ON orders.customer_id = customers.customer_id; -- We can filter the data to not show a specific customer: SELECT orders.order_id, customers.customer_name FROM orders JOIN customers ON orders.customer_id = customers.customer_id WHERE customers.customer_name != 'Lucy Lucy';
You can also work with more than two tables:
SELECT orders.order_id, customers.customer_name, shippers.shipper_name FROM orders JOIN customers ON orders.customer_id = customers.customer_id JOIN shippers ON orders.shipper_id = shippers.shipper_id;
The table before the statement
LEFT JOINis the left table while the one after is the right table.
A LEFT JOIN will return
all the records in the left table and the matching records in the right table. If there are no matching records, the result will contain
-- customers = left table SELECT customers.customer_name, orders.order_id FROM customers LEFT JOIN orders ON customers.customer_id = orders.customer_id;
The table before the statement
JOINis the right table while the one after is the left table.
A RIGHT JOIN will return
all the records in the right table and the matching records in the left table. If there are no matching records, the result will contain
-- customers = right table SELECT customers.customer_name, orders.order_id FROM customers RIGHT JOIN orders ON customers.customer_id = orders.customer_id;
This is a query within another query, also known as a Nested query. It is usually embedded within the WHERE clause.
-- showing the highest paid employees SELECT * FROM employees WHERE salary = (SELECT MAX(salary) FROM employees);
This can be used when you need to add a category where the values are determined by an
if...else statement(CASE statement)
SELECT order_id, order_total, CASE WHEN order_total < 20 THEN 'Order total is less than $20' ELSE 'Order total is greater than $20' END AS sales_threshold FROM orders;
Common Table Expressions (CTEs)
CTEs are used to create temporary tables that are then used to extract the information we need.
-- weekly_orders is the temporary table WITH weekly_orders AS( SELECT customer_id, DATE_PART('week', order_date) AS week, COUNT(order_id) AS order_numbers FROM orders GROUP BY customer_id, week) SELECT customer_id, AVG(order_numbers) FROM weekly_rentals GROUP BY customer_id
I hope you found this article helpful in your SQL journey. Practice questions will definitely help you retain all the information you have learned. Use platforms like Hackerranck to level up your SQL skills.
If you found this article helpful, make sure to like it or leave a comment.