DEV Community

MD ARIFUL HAQUE
MD ARIFUL HAQUE

Posted on

Understanding GROUP BY vs. PARTITION BY in SQL

Both GROUP BY and PARTITION BY are used to organize and work with data, but they serve different purposes and are used in different contexts. Let’s dive into the details of each and explore how they differ.


1. GROUP BY

The GROUP BY clause is used to aggregate data across multiple rows into groups based on column(s) and apply aggregate functions like COUNT(), SUM(), AVG(), etc., to each group. It reduces the number of rows by aggregating them.

How GROUP BY Works:

  • Purpose: To group rows that have the same values in specified columns and apply aggregate functions to the grouped rows.
  • Output: One row for each group.
  • Aggregate Functions: Common functions used with GROUP BY include COUNT(), SUM(), MIN(), MAX(), and AVG().

Example of GROUP BY:

SELECT department, COUNT(employee_id) AS total_employees
FROM employees
GROUP BY department;
Enter fullscreen mode Exit fullscreen mode

In this example:

  • What it does: Groups employees by department and counts the total number of employees in each department.
  • Result: A single row for each department, showing the department name and the count of employees.

Key Points:

  • GROUP BY reduces the number of rows in the result set.
  • It’s typically used with aggregate functions to summarize or aggregate the data.
  • It cannot be used with window functions.

2. PARTITION BY

The PARTITION BY clause is used with window functions to divide result sets into partitions (similar to groups), but it does not collapse rows like GROUP BY. Instead, it allows the application of aggregate functions to individual rows while still keeping the original data intact.

How PARTITION BY Works:

  • Purpose: To define partitions (groups) of data within a result set so that window functions (e.g., ROW_NUMBER(), RANK(), SUM(), AVG()) can be applied to each partition.
  • Output: Each row retains its individuality, but calculations like running totals, ranks, or averages are applied within each partition.
  • Window Functions: Common window functions used with PARTITION BY include ROW_NUMBER(), RANK(), NTILE(), LAG(), LEAD(), and windowed aggregates like SUM().

Example of PARTITION BY:

SELECT employee_id, department, salary,
       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_in_dept
FROM employees;
Enter fullscreen mode Exit fullscreen mode

In this example:

  • What it does: Partitions the data by department and ranks employees within each department based on their salary, with the highest-paid employee getting the rank 1.
  • Result: The original list of employees is preserved, but now each employee has an additional column showing their rank within their department.

Key Points:

  • PARTITION BY is used with window functions and does not reduce the number of rows.
  • It allows calculations like cumulative sums, running totals, and rankings within each partition.
  • Unlike GROUP BY, PARTITION BY keeps all rows visible in the result set while applying the function to partitions.

Key Differences Between GROUP BY and PARTITION BY:

Aspect GROUP BY PARTITION BY
Purpose Group rows and aggregate data Partition rows for window functions
Result Reduces the number of rows Keeps all rows, applies calculations within partitions
Used With Aggregate functions (COUNT(), SUM(), etc.) Window functions (ROW_NUMBER(), RANK(), SUM(), etc.)
Common Use Case Summarizing data (e.g., total sales per category) Ranking or applying calculations over individual rows while retaining the dataset
Row Visibility Shows one row per group Shows all rows
Performance Often reduces rows, potentially faster Retains rows, can be computationally heavier for large datasets
Order of Operation Aggregation happens first, then filtering via HAVING Calculation happens over the result set, can be used with ORDER BY for complex ranking

Use Cases

When to Use GROUP BY:

  • Summarizing data: If you need to calculate total sales per customer, count employees per department, or find the average salary per team, GROUP BY is the right choice.
  • Reducing data: When you want to collapse multiple rows into a single summary row per group.

When to Use PARTITION BY:

  • Ranking data: If you want to rank employees based on their salary within each department without losing individual employee rows, use PARTITION BY.
  • Running totals/averages: When you need cumulative calculations like a running total, PARTITION BY enables this while keeping all rows visible.
  • Comparing data within groups: You can compare rows within partitions (e.g., calculating the difference in salary between the highest and lowest paid employees in each department).

Example: GROUP BY vs PARTITION BY in Action

GROUP BY Example: Summarizing sales per region

SELECT region, SUM(sales) AS total_sales
FROM sales_data
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode
  • Result: One row per region showing the total sales for that region.

PARTITION BY Example: Ranking salespeople by performance within regions

SELECT salesperson_id, region, sales,
       RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS rank_in_region
FROM sales_data;
Enter fullscreen mode Exit fullscreen mode
  • Result: All salespeople are shown, but they are ranked within their regions based on sales performance.

Conclusion

  • GROUP BY is used for summarizing and aggregating data, reducing rows by collapsing them into groups.
  • PARTITION BY is used with window functions to perform calculations (like ranking, running totals, etc.) on partitions of data without reducing the number of rows.

Both are essential for different types of data processing tasks in SQL, and understanding when to use each is key to efficient querying.

Top comments (0)