DEV Community

Aspersh Upadhyay
Aspersh Upadhyay

Posted on • Originally published at Medium on

SQL For Data Analysis -Basic To Advanced Queries

Master data analysis with this practical SQL guide

SQL for Data Analysis — Basic to Advanced Queries
Image by Author

Learn how to leverage SQL’s powerful features like aggregates , joins , subqueries , window functions , and CTEs to query, manipulate , and gain insights from relational data. Whether you’re a beginner looking to learn SQL basics or an expert wanting to strengthen your data analysis skills, this tutorial will take your SQL queries to the next level.

Follow along with clear examples demonstrating each concept so you can apply these data manipulation techniques in your own projects. By the end, you’ll have the SQL proficiency to extract game-changing business intelligence from your company’s data.

What is SQL?

SQL stands for Structured Query Language, and it’s a programming language that’s used to manage and manipulate relational databases. Relational databases are collections of data that are organized into tables with rows and columns. SQL allows you to extract data from these tables, modify the data, and insert new data.

Basic SQL Syntax

The basic syntax of SQL consists of commands that are used for creating, modifying, and querying databases. The syntax consists of commands such as SELECT, INSERT, UPDATE, DELETE, and CREATE. SQL commands are not case sensitive, but it is common practice to write them in uppercase. Now further real story begins.

Retrieving Data from a Single Table

In SQL, the SELECT statement is used for retrieving data from a table. The FROM clause specifies which table you will use. The WHERE clause is used for filtering data, the ORDER BY clause is used for sorting data, and the LIMIT clause is used for limiting the number of rows returned.

SELECT & FROM

The SELECT statement specifies the columns that you want to retrieve, and it is followed by the column names that you want to select. If you want to retrieve all the columns, you can use the *symbol in place of the column names.

The FROM clause specifies the table or tables from which you want to retrieve the data, and it is used after the SELECT statement. By combining the SELECT statement and FROM clause, you can retrieve specific columns or all columns from one or more tables.

SELECT first_name, last_name
FROM employees
Enter fullscreen mode Exit fullscreen mode

WHERE

The WHERE clause is used to filter data based on specific conditions. You can use various operators such as =, <>, <, >, <=, >=, LIKE, IN, BETWEEN, IS NULL, and IS NOT NULL to define the conditions

SELECT * FROM employees
WHERE salary > 50000 AND department = 'IT';
Enter fullscreen mode Exit fullscreen mode

ORDER BY

The ORDER BY clause is used to sort data in ascending or descending order based on one or multiple columns. You can specify the ASC keyword for ascending order and DESC keyword for descending order. If no keyword is provided, the default is ascending order

SELECT * FROM employees
ORDER BY last_name ASC, first_name DESC;
Enter fullscreen mode Exit fullscreen mode

LIMIT

The LIMIT clause is used for limiting the number of rows returned. It is used to limit the number of rows returned to a certain number.

SELECT id, name, salary
FROM employees
LIMIT 30;
Enter fullscreen mode Exit fullscreen mode

Joins — Retrieve data from mulitple tables

In SQL, the JOIN operation is used for retrieving data from multiple tables. There are four types of JOIN operations — INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

INNER JOIN

An INNER JOIN returns only the rows that have matching values in both tables. It is the most common type of JOIN operation. In this case, the result set will only include the records where there is a match in both tables.

SELECT employees.name, departments.department_name
FROM employees
INNER JOIN departments 
ON employees.department_id = departments.department_id;
Enter fullscreen mode Exit fullscreen mode

LEFT JOIN

A LEFT JOIN returns all the rows from the left table (table1) and the matched rows from the right table (table2). If there is no match, NULL values are returned for the columns from the right table.

SELECT employees.name, departments.department_name
FROM employees
LEFT JOIN departments 
ON employees.department_id = departments.department_id;
Enter fullscreen mode Exit fullscreen mode

RIGHT JOIN

A RIGHT JOIN works similarly to a LEFT JOIN, but it returns all the rows from the right table (table2) and the matched rows from the left table (table1). If there is no match, NULL values are returned for the columns from the left table.

SELECT employees.name, departments.department_name
FROM employees
RIGHT JOIN departments 
ON employees.department_id = departments.department_id;
Enter fullscreen mode Exit fullscreen mode

FULL OUTER JOIN

A FULL OUTER JOIN combines the results of both LEFT JOIN and RIGHT JOIN. It returns all the rows from both tables, with NULL values for unmatched columns.

SELECT employees.name, departments.department_name
FROM employees
FULL OUTER JOIN departments 
ON employees.department_id = departments.department_id;
Enter fullscreen mode Exit fullscreen mode

Bonus — SELF JOIN

A self join is a type of SQL join where a table is joined to itself. It is used to combine rows from the same table based on a related column. Self joins are useful when you want to compare rows within the same table or find relationships between the rows of the same table. To perform a self join, you need to use aliases to differentiate the two instances of the same table.

SELECT e1.employee_name AS Employee, e2.employee_name AS Manager
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.employee_id;
Enter fullscreen mode Exit fullscreen mode

Filtering and Sorting Data

In SQL, the GROUP BY clause is used for grouping data, and the HAVING clause is used for filtering data after grouping. WHERE and ORDER BY We have already covered it in above section👆

GROUP BY clause

The GROUP BY clause is used for grouping data based on one or more columns. It is used with aggregate functions such as COUNT, SUM, AVG, MIN, and MAX.

SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
Enter fullscreen mode Exit fullscreen mode

HAVING clause

The HAVING clause is a useful feature in SQL that allows you to filter the results of a query based on a condition applied to aggregated data. It is often used in conjunction with the GROUP BY clause, which groups rows with similar values together. The HAVING clause is particularly helpful when you want to retrieve specific groups based on a certain criteria, such as the sum or average of a column.

SELECT product_id, SUM(quantity_sold) as total_units_sold
FROM sales
GROUP BY product_id
HAVING total_units_sold > 100;
Enter fullscreen mode Exit fullscreen mode

LIKE operator with wildcards

The LIKE operator is used in conjunction with wildcards to filter data based on patterns. The % wildcard represents any number of characters, while the _ wildcard represents a single character.

SELECT * FROM employees
WHERE last_name LIKE 'Smi%';
Enter fullscreen mode Exit fullscreen mode

Aggregate Function

Aggregate functions are used to perform calculations on a set of values, and they return a single value as the result. They are super helpful when you need to analyze and summarize data in your database.

Some common aggregate functions in SQL are:

  • COUNT(): This function returns the number of rows that match a specified condition.
  • SUM(): This function returns the total sum of a numeric column.
  • AVG(): This function returns the average value of a numeric column.
  • MIN(): This function returns the smallest value of a selected column.
  • MAX(): This function returns the largest value of a selected column.

Let me show you some examples of how to use these aggregate functions in SQL:

COUNT():

Suppose you have a table called ‘employees’ and you want to know the total number of employees. You can use the COUNT() function like this:

SELECT COUNT(*) FROM employees;
Enter fullscreen mode Exit fullscreen mode

SUM():

If you want to calculate the total salary of all employees, you can use the SUM() function:

SELECT SUM(salary) FROM employees;
Enter fullscreen mode Exit fullscreen mode

AVG():

To find the average salary of all employees, you can use the AVG() function:

SELECT AVG(salary) FROM employees;
Enter fullscreen mode Exit fullscreen mode

MIN():

If you want to find the employee with the lowest salary, you can use the MIN() function:

SELECT MIN(salary) FROM employees;
Enter fullscreen mode Exit fullscreen mode

MAX():

vvSimilarly, to find the employee with the highest salary, you can use the MAX() function:

SELECT MAX(salary) FROM employees;
Enter fullscreen mode Exit fullscreen mode

You can also use aggregate functions with the GROUP BY clause to group the results by one or more columns. For example, if you want to find the total salary for each department, you can do this:

SELECT department_id, SUM(salary) 
FROM employees 
GROUP BY department_id;
Enter fullscreen mode Exit fullscreen mode

J ust a Quick Reminder : Feel free to follow Aspersh Upadhyay for more such interesting topics. If you want free learning resources related to Data Science and related field. Join our Telegram Channel Bit of Data Science

Subqueries

Subqueries are a powerful feature in SQL that allows you to perform complex queries by breaking them down into smaller, more manageable parts. In simple terms, a subquery is a query embedded within another query, often used to filter or modify the results of the outer query. They can be used in various parts of a SQL query, such as the SELECT, WHERE, and JOIN clauses.

Let’s dive into the details of subqueries with a few examples:

Basic subquery:

A basic subquery is a query that is enclosed within parentheses and used within the SELECT, WHERE, or JOIN clause of another query. It returns a result set that can be used as a value or a condition for the outer query.

Example:

SELECT * FROM employees 
WHERE salary > (SELECT AVG(salary) FROM employees);
Enter fullscreen mode Exit fullscreen mode

Correlated subquery:

A correlated subquery is a subquery that refers to a value in the outer query. It is used to filter or modify the results of the outer query based on the value of a specific column in the outer query.

Example:

SELECT * 
FROM employees 
WHERE department_id = (SELECT department_id FROM departments WHERE name = 'IT');
Enter fullscreen mode Exit fullscreen mode

Nested subquery:

A nested subquery is a subquery that is used within another subquery. This can be useful when you need to perform complex calculations or comparisons based on multiple levels of data.

Example:

SELECT * 
FROM employees 
WHERE salary > (SELECT AVG(salary) 
FROM (SELECT * FROM employees WHERE department_id = 1));
Enter fullscreen mode Exit fullscreen mode

Common Table Expressions (CTEs)

Common Table Expressions, or CTEs, are a powerful feature in SQL that allows you to create temporary result sets that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs are particularly useful for simplifying complex queries, breaking them down into smaller, more manageable parts. They can also be used to create recursive queries for hierarchical data.

To create a CTE, you use the WITH keyword followed by the CTE name and the column names in parentheses. Then, you provide the query that defines the CTE. Let’s take a look at a simple example:

WITH customer_total_orders (customer_id, total_orders) AS (
  SELECT customer_id, COUNT(*) as total_orders
  FROM orders
  GROUP BY customer_id
)
SELECT *
FROM customer_total_orders
WHERE total_orders > 10;
Enter fullscreen mode Exit fullscreen mode

CTEs can also be chained together, allowing you to reference one CTE in another. Here’s an example:

WITH customer_total_orders (customer_id, total_orders) AS (
  SELECT customer_id, COUNT(*) as total_orders
  FROM orders
  GROUP BY customer_id
),
top_customers (customer_id) AS (
  SELECT customer_id
  FROM customer_total_orders
  WHERE total_orders > 10
)
SELECT c.*, t.total_orders
FROM customers c
JOIN top_customers tc ON c.customer_id = tc.customer_id
JOIN customer_total_orders t ON c.customer_id = t.customer_id;
Enter fullscreen mode Exit fullscreen mode

Finally, CTEs can be used to create recursive queries. For example, if you have a table storing hierarchical data, such as employees and their managers, you can use a recursive CTE to retrieve the entire hierarchy:

WITH RECURSIVE employee_hierarchy (employee_id, manager_id, level) AS (
  SELECT employee_id, manager_id, 1 as level
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.employee_id, e.manager_id, eh.level + 1
  FROM employees e
  JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT *
FROM employee_hierarchy;
Enter fullscreen mode Exit fullscreen mode

Window Function

Window functions in SQL are a powerful tool that allows you to perform calculations across a set of rows related to the current row. They are called “ window functions ” because they provide a “ window ” into the surrounding data. Here are commonly used window functions:

ROW_NUMBER()

ROW_NUMBER() assigns a unique number to each row within the result set. It’s useful when you want to assign a sequential order to rows, like ranking or pagination.

Example:

SELECT name, age, 
ROW_NUMBER() OVER (ORDER BY age) as row_number
FROM people;
Enter fullscreen mode Exit fullscreen mode

RANK()

RANK() assigns a unique rank to each row within the result set, with the same rank assigned to rows with equal values. Rows with equal values get the same rank, and the next rank will be skipped.

Example:

SELECT name, score, 
RANK() OVER (ORDER BY score DESC) as rank
FROM exam_results;
Enter fullscreen mode Exit fullscreen mode

DENSE_RANK()

DENSE_RANK() is similar to RANK(), but it doesn’t skip any rank numbers. It assigns a unique rank to each row within the result set, with the same rank assigned to rows with equal values.

Example:

SELECT name, score, 
DENSE_RANK() OVER (ORDER BY score DESC) as dense_rank
FROM exam_results;
Enter fullscreen mode Exit fullscreen mode

SUM()

SUM() calculates the cumulative sum of a specific column’s value across the rows in the window frame. It’s useful when you want to find running totals.

Example:

SELECT date, sales, 
SUM(sales) OVER (ORDER BY date) as cumulative_sales
FROM daily_sales;
Enter fullscreen mode Exit fullscreen mode

AVG()

AVG() calculates the average of a specific column’s value across the rows in the window frame. It’s useful when you want to find the moving average.

Example:

SELECT date, temperature, 
AVG(temperature) 
OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_average
FROM daily_temperatures;
Enter fullscreen mode Exit fullscreen mode

Conclusion:

SQL is a powerful and versatile programming language that can help you perform data analysis on relational databases. In this article, you learned the basics of SQL syntax, how to join data from multiple tables, how to filter and sort data, how to use aggregate functions, subqueries, common table expressions, and window functions. By applying these skills, you can query and manipulate data in various ways and gain insights from your data.

I hope you enjoyed this article and found it useful for your data analysis projects. If you have any questions or feedback, please let me know in the comments below. Thank you for reading!

For more exciting insights into the world of data and machine learning, be sure to follow me on Medium and connect with me on LinkedIn. I publish articles regularly and would love to continue the conversation with you!

Wanna Free Resources for Data Science, Artificial Intelligence, Machine Learning, Python, SQL and more. Join our Telegram channel Bits of Data Science

If you want to support me for writing, you can buy me a cup of coffee . Would be greatly appreciated.

Happy Learning SQL! 😀

Top comments (0)