Marriane Akeyo

Posted on Apr 17, 2023 • Edited on May 4, 2023

SQL 102:Intermediate SQL

#sql #database #postgres #dataengineering

This post is a continuation of the Introduction to SQL. However, if you are familiar with basic SQL concepts you can take a read through this article for more advance concepts.

CASE WHEN AND THEN

Evaluates a set of conditions just like an if else statement and return a corresponding result when the condition is true.
The corresponding syntax is as shown:

SELECT column1,
       column2,
       CASE WHEN condition1 THEN result1
            WHEN condition2 THEN result2
           ...(other when conditions)
           ELSE default_result
       END AS new_column_name
FROM table_name;

Lets look at a more concrete example. Assuming we have a matches table with different teams and how they played and we want to determine if the won when they played at home, we shall have the following query.

SELECT id,
       name,
       CASE WHEN home_goal > away_goal THEN 'Its a home win'
            WHEN home_goal < away_goal THEN 'Its a home loss'
            ELSE 'Its a tie'
       END AS home_scores
INTO home_analysis
FROM matches;

The output of the above code is three columns id, name and home_scores which are stored inside the home_analysis table.
We used the CASE statement to filter the results of the home_scores column.

You can also use a CASE statement to aggregate data based on the results of a WHERE clause, since aggregate functions cannot be used directly in a WHERE clause.

Let's say you have a table called "employees" with columns "name", "salary", and "department". You want to create a report that shows the total salary for each department, and categorize each department as either "High Paying" or "Low Paying" based on the total salary. The results should include only employees hired on or after January 1, 2023

SELECT department,
       SUM(salary) as total_salary
       CASE WHEN SUM(salary) >= 1000000 THEN 'High paying dpt'
            ELSE 'Low paying dpt'
       END AS dpt_salary
FROM employees
WHERE hire_date >= '2023-01-01'
GROUP BY department;

The sum of the salaries will only be placed in any category if they satisfy the where clause first.

SUBQUERIES

They are queries found inside other queries hence also known as nested queries.The subquery is executed first, and the results are then used in the main query. This allows you to perform more complex queries that involve multiple tables and conditions.
The syntax is as shown:

SELECT column
FROM (SELECT column
      FROM table) AS subquery

A subquery can be placed at any part of your query, such as the SELECT, FROM, WHERE or GROUP BY
The allow you to compare summarized values to detailed data and also reshape your data as desired.
A subquery is also used to combine data that cannot be joined.

A subquery in the where clause

It is used to filter data based on the results of another table.For example, if you have a table of customers and a table of orders, you can use a subquery to find all customers who have made at least one order.

SELECT *
FROM customers
WHERE id IN (SELECT c_id FROM orders);

A subquery in the FROM clause

They are used to restructure and transform the data to be selected. Ensure you alias the subquery for easy reference.
Lets use the customer and order table for reference and see the orders for each customer in the 2022/2023 season.

SELECT name,
       order
FROM (SELECT c.name AS name 
             o.id AS order
      FROM customers AS c
      INNER JOIN orders AS o
      ON c.order = o.id)
      AS customer_order
WHERE season = '2022/2023';

We can also create more than one subquery inside a from statement.

SELECT customers.customer_id,
      customers.first_name, '
      customers.last_name, 
      orders.total
FROM (
   SELECT customer_id, 
          SUM(price * quantity) AS total
   FROM order_items
   GROUP BY customer_id
   ) AS orders
INNER JOIN customers 
ON orders.customer_id = customers.customer_id
WHERE orders.total > (
   SELECT AVG(total)
       FROM (
          SELECT customer_id, 
                 SUM(price * quantity) AS total
          FROM order_items
          GROUP BY customer_id
         ) AS order_totals
)

In this example, we're using two subqueries in the FROM clause. The first subquery calculates the total amount spent by each customer on all of their orders. The second subquery calculates the average total amount spent by all customers.

The main query then joins the orders subquery with the customers table to retrieve the customer's ID, first name, and last name, as well as their total amount spent. Finally, the WHERE clause filters the results to only include customers whose total amount spent is greater than the average total amount spent by all customers.

Subquries in the select statement

They can be used to perform calculations based on data from another table. For example, if you have a table of orders and a table of products, you can use a subquery to calculate the total price of each order.

SELECT id, 
       (SELECT SUM(price * quantity)
        FROM products 
         WHERE order.id = products.order_id) AS total_price
FROM orders;

It can also be used to retrieve information from a table as we saw above we got the total_price of our commodities from products table.
The big question however is, when is it advisable to use subqueries considering that each subquery requires additional computing power. There are certain factors to consider when selecting a subquery:
1.Data complexity
Subqueries are useful when working with complex data queries that require multiple tables and conditions. If the query involves multiple subqueries or multiple nested subqueries, it can become very difficult to read and maintain. In this case, it may be better to break the query into smaller parts or consider using a different approach.

2.Data volume
Subqueries can be slow to execute when dealing with large datasets. If the query involves a large amount of data, it may be better to use a join or a different type of query.

3.Performance
The performance of a subquery can depend on the database engine being used. Some database engines can optimize subqueries for better performance. It's important to consider the performance implications when deciding whether to use a subquery.

4.Maintainability
Subqueries can make queries more difficult to read and maintain. It's important to consider whether the use of a subquery will make the query more or less maintainable in the long run.
5.Query reuse
If the same subquery is going to be used multiple times in different parts of a query or across multiple queries, it may be more efficient to create a view or a temporary table instead of using a subquery.

Corelated Subqueries

It is a type of subquery which uses the values from the outer query to generate results, by referencing one or more columns in the main query. It is always re-run for every new row generated.

SELECT *
FROM employees AS e
WHERE e.salary > (
     SELECT AVG(salary)
     FROM employees
     WHERE dpt_id = e.dpt_id
)

It is also very possible to have a nested subquery. Lets select an employee's name and the average salary he received in the first quarter of the year.

SELECT
     e.name AS employee,
     (SELECT AVG(jan_pay + feb_pay + mar_pay),
      FROM salary AS s
      WHERE s.empl_id = e.id
          AND id IN (
               SELECT id 
               FROM salary 
               WHERE year == 2023
               ) AS avg_salary
FROM employees AS e
GROUP BY employee;

The subquery in the SELECT clause is dependent on the equality of the salary and the employees id, which is also dependent on the year column in the salary table.

Common Table Expressions(CTEs)

Subquries are a good way of bringing multiple unrelated tables together and also simplifying calculations. However, having multiple nested subqueries can be hard to read and understand as discussed above. That is why a solution to this can be using CTEs.
CTEs are declared ahead of the main query using a WITH statement and referenced later in the FROM statement.
It can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement in SQL.

WITH s AS (
   SELECT emp_id, id,
         AVG(jan_pay + feb_pay + mar_pay) AS q1
   FROM salary
   WHERE year == 2023
   GROUP BY id
)
SELECT e.name AS employee,
       s.q1
FROM employees AS e
INNER JOIN s
ON e.id = s.emp_id

CTEs are better than subqueries because they are run only once and then stored in memory, hence reducing the amount of time required to run the query.
You can also have multiple CTEs, separated by a comma but make sure you join them in their respective order.

WITH max_salary AS (
  SELECT department_id, MAX(salary) AS max_salary
  FROM salary
  GROUP BY department_id
), 
avg_salary AS (
  SELECT department_id, AVG(salary) AS avg_salary
  FROM salary
  GROUP BY department_id
)
SELECT employees.*, max_salary.max_salary, avg_salary.avg_salary
FROM employees
JOIN max_salary ON employees.department_id = max_salary.department_id
JOIN avg_salary ON employees.department_id = avg_salary.department_id
WHERE employees.salary = max_salary.max_salary

Window Functions

Perform a set of operations that are somehow related to the current row, similar to group by but rows are not groupd into a single row.
They are called "window" functions because they operate on a "window" of rows, defined by an OVER clause, rather than on individual rows.
Some use cases of window functions include:

Fetching values before and after the current row.
Ranking row values.
Calculating averages See the example below:

SELECT transaction_id, 
       sales_amount, 
       SUM(sales_amount) OVER() AS total_sales
FROM sales_transactions

This sums the sales_amount per amount and displays it as total_sales.The first row will contain its own value in the total sales column, the second row will contain a sum of the first and second row and so on.
At any given point we might want to give each of our rows a value in order to easily access it. This can be done using the
ROW_NUMBER() function.

SELECT transaction_id, 
       sales_amount, 
       ROW_NUMBER() OVER() AS row_n
FROM sales_transactions

The output of the above query will contain an extra row called row_n with row numbers from 1 and so on.
It is possible that we want the values in the row_n columns to be ordered in a certain way. We can thus use the ORDER BY clause in OVER() function as shown:

SELECT transaction_id, 
       sales_amount, 
       ROW_NUMBER() OVER(ORDER BY sales_amount DESC) AS row_n
FROM sales_transactions

The row_n values will now be sorted starting from the one with the highest sales order to the lowest.

We can generate ranks instead of ordering as well using the RANK() window function.

SELECT transaction_id, 
       sales_amount, 
       SUM(sales_amount) RANK() OVER() AS total_sales
FROM sales_transactions

Note that window functions are produced after other queries have been processed. It uses the generated table to find its results.
As stated earlier we can also use window functions to compare the current row and the one before it. This is made possible using the LAG() function. Example:

Assuming we have a games table with year, champion,medals,match and other columns in it. We can find out which champion won recurrently.

SELECT 
      year,match, champion
      LAG(champion) OVER(
      ORDER BY year ASC) AS last_champion
FROM games
ORDER BY year ASC;

We can now compare the champion and last_champion columns and find recurrent champions.
The above query will produce the values for last_champion regardless of weather the matches are similar or not. For example if we had matches for hockey and rollball within the same year it will not provide the last_champion for each separate match but just generalize them.
To solve this , we use a PARTITION BY clause as shown below.

SELECT 
      year,match, champion
      LAG(champion) OVER(
      PARTITION BY match
      ORDER BY year ASC) AS last_champion
FROM games
ORDER BY year ASC;

Window partitions also group the result set of a row into a smaller group so that window functions can be applied to those subsets separately.
Lets look at another example:

SELECT department_id, 
       employee_id, 
       salary, 
       AVG(salary) OVER (PARTITION BY department_id
                              ) AS avg_salary
FROM employees

The results will be first partitioned then average salary calculated.

Sliding Windows

They enable us to perform calculations relative to the current window of the dataset.
The window is defined by a rage of rows that "slide " or move through the partition based on a specified ordering.
The syntax is as shown:

ROWS BETWEEN <start> AND <finish>

The start and finish can be replaced with PRECEDING(comes before), FOLLOWING(comes after), UNBOUNDED PRECEDING(all rows in the partition up to and including the current row), UNBOUNDED FOLLOWING, CURRENT ROW

SELECT date,
       jan_pay,
       SUM (jan_pay)
       OVER(ORDER BY date ROWS BETWEEN
            UNBOUNDED PRECEDING AND CURRENT ROW)
       AS jan_total
FROM salary
WHERE year = 2023;

Fetching

They include functions that enable us to get certain values from different parts of the table or partition into one row.
The LAG() function discussed above is also one of them.
The LEAD() function is used return the values after the current row. Lets use the LAG() example shown above.

SELECT 
      year,match, champion
      LEAD(champion) OVER(
      ORDER BY year ASC) AS following_champion
FROM games
ORDER BY year ASC;

The above query compares the current champion and the champion after him.
FIRST_VALUE returns the first value in the table or partition.

SELECT 
      year,match, champion
      FIRST_VALUE(champion) OVER(
      ORDER BY year ASC) AS first
FROM games
ORDER BY year ASC;

LAST_VALUE returns the last value in the table or partition.

SELECT 
      year,match, champion
      LAST_VALUE(champion) OVER(
      ORDER BY year ASC) AS last
FROM games
ORDER BY year ASC;

NOTE

A sliding window can also be defined as a frame. Without frame the LAST_VALUE() function would return the current row. Efficient example of the LAST_VALUE() would be:

SELECT 
      year,match, champion
      LAST_VALUE(champion) OVER(
      ORDER BY year ASC
      ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW)) AS last
FROM games
ORDER BY year ASC;

There are two types of frames, RANGE BETWEEN and ROWS BETWEEN. RANGE BETWEEN however lists duplicate values when used with the OVER's ORDER BY subclause as a single entity. Hence ROWS BETTWEEN is mostly used over RANGE BETWEEN.

Ranking

It involves arranging the data in a partition or table in a particular order.
The ranking functions include:
ROW_NUMBER(): It assigns each row a unique number even if the row values are the same. Check the window functions example for more clarification.

RANK(): It orders the values of a table or partition giving rows with identical values the same number and skipping one number over to the next value.

DENSE_RANK(): It ranks values in a particular table or partition giving the same number for identical values but does not skip a number to the next value, hence mostly preferred.

SELECT transaction_id, 
       sales_amount, 
       SUM(sales_amount) DENSE_RANK() OVER() AS rank_sales
FROM sales_transactions

This will rank the total sales giving identical values the same rank then move to the next total_sales value without skipping a number.

transaction_id	sales_amount	rank_sales
iutio	64	1
ghjkl	48	2
frtyu	48	2
lkjhg	36	3

Paging

This includes splitting data into approximately equal chunks which enables us to easily analyze data piece by piece.
Paging in sql is done using the NTILE(n) which splits data into approximately n equal pages.

SELECT transaction_id, 
       sales_amount, 
       NTILE(15) OVER() AS page1
FROM sales_transactions

In the above query the rows are divided into 15 buckets when called.

Pivoting

It makes it easier to understand and analyze data by converting columns into rows. It is very valuable when preparing data for visualization.

CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM CROSSTAB ( $$
     SELECT country,
            year,
            COUNT(*) :: INTEGER AS awards
     FROM medals
     WHERE year IN (2022, 2023)
     GROUP BY country, year
     ORDER BY country ASC;
      $$) AS ct ( country VARCHAR, 
                 "2022" INTEGER,
                  "2023" INTEGER)
 ORDER BY country ASC;

The above query converts the country and year column in the medal table into rows and splits the year into 2022 and 2023.
Each row is also given their respective datatype.

Conclusion

In conclusion, this article has covered several intermediate SQL topics that can help you gain a deeper understanding of SQL and data manipulation. By mastering these concepts, you will be able to write more complex SQL queries and work with large datasets more efficiently. I hope you found this article helpful in your SQL journey and that it has provided you with valuable insights and knowledge. Keep practicing and exploring the world of SQL to become a proficient data analyst or developer. Stay curious and stay golden!

DEV Community