DEV Community

loading...
Cover image for Oracle SQL Analytic Functions - Part 1

Oracle SQL Analytic Functions - Part 1

Mark Dsouza
Taking control of my career & constantly learning to be a better Full Stack Developer.
・10 min read

Do you have a bunch of raw data stored in your Oracle Database and do you need to derive something meaningful out of the raw data?

Depending on your DB structure, you might only need basic Joins and Filters. For more complex data sets and reporting requirements, you will probably need to use Analytic functions to convert your data into something more meaningful.
There are a huge bunch of these functions and using them can become a little bit tricky if it's the first time you are working on some sort of DB data manipulation.

In this series of posts, I'm going to try and slowly expose you to the basic analytic functions(at least the commonly used ones), describe them and break down how you can really configure them to suit your exact needs.
By the end of this, I hope to empower you with the knowledge to convert Data set A(your source data) into Data set B(your target data)

Part 1 will look at the 3 easier and commonly used functions: rank(), row_number() and dense_rank()

Let's take this sample Data set to explain these functions

We have a Table EMPLOYEE with 7 rows of data.
Let's run a basic select statement to see what data is present in this table

select EMP.* from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY
100 MICHAEL 9000
101 PAM 6000
102 KEVIN 6000
103 DWIGHT 7500
104 ANGELA 6000
105 JIM 7500
106 ANDY 4000

Here EMPLOYEE_ID is the primary key of the table. We have 7 employes with their EMPLOYEE_NAME and SALARY stored.
All 3 functions will assign a number to each row based on the criteria we give it, but how it gets assigned for edge cases is what is different about each. Let's take a look at them one by one.

rank()

The syntax for applying this would be

select EMP.*, 
rank() over(order by SALARY desc) as SAL_RANK 
from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode

Now let's take a look at what we just wrote - we called the rank() function but we also mentioned some additional text over (order by SALARY).
So why do we need this and what are we doing here?
We basically need to specify how to do the ranking so Oracle knows how rank() needs to be applied.
over is basically telling Oracle - Over what dataset and how to apply the rank. This is going to be used for all analaytic functions to add configuration.
order by SALARY tells Oracle to sort the data in ascending order. By default all sorts are ascending.
The 3 ways of writing this are
rank() over (order by SALARY)
rank() over (order by SALARY ASC)
rank() over (order by SALARY DESC)
Since the default sort is Ascending, Line 1 and Line 2 would give you the same result.

Now that we have gone over the syntax. Let's look at what the target dataset looks like after calling this and what really happens.

select EMP.*, 
rank() over(order by SALARY desc) as SAL_RANK 
from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_RANK
100 MICHAEL 9000 1
101 PAM 6000 4
102 KEVIN 6000 4
103 DWIGHT 7500 2
104 ANGELA 6000 4
105 JIM 7500 2
106 ANDY 4000 7

Let me order the dataset as per the Rank just to understand the data better.

select * from 
    (select EMP.*, 
    rank() over(order by SALARY desc) as SAL_RANK 
    from EMPLOYEE EMP)
order by SAL_RANK 
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_RANK
100 MICHAEL 9000 1
105 JIM 7500 2
103 DWIGHT 7500 2
102 KEVIN 6000 4
104 ANGELA 6000 4
101 PAM 6000 4
106 ANDY 4000 7

As mentioned, this function assigns a number to each and every row.
We also see that if there are rows in our table with the same value for Salary, they get assigned the same rank!
Lets quickly look at our data. We have 1 employee MICHAEL who gets paid 9000 and is assigned Rank 1. We have 2 Employees JIM and DWIGHT who are both assigned Rank 2 because they are both earning 7500. Now, since we have both of them assigned to 2, the rank function skips assigning a value 3 and next assigns 4 to whoever is next in the data after sorting. We see there are 3 people with salary 6000. All 3 are assigned the Rank 4. Finally Rank 7 is assigned to the person with 4000.
Say in this case ANDY was not a part of your dataset, the maximum rank would be 4. Hence it is not necessary that the lowest rank equals the total number of rows in the dataset.

dense_rank()

The syntax for applying dense_rank() would be similar to rank()

select EMP.*, 
rank() over(order by SALARY desc) as SAL_RANK,
dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK
from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode

Dense Rank is very similar to Rank. The only difference is that it does not skip any numbers when assigning a value to a row. Hence, for our example, even though we have 7 employees, our dense_rank value will be 1,2,3 and 4.

EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_RANK SAL_DENSE_RANK
100 MICHAEL 9000 1 1
105 JIM 7500 2 2
103 DWIGHT 7500 2 2
102 KEVIN 6000 4 3
104 ANGELA 6000 4 3
101 PAM 6000 4 3
106 ANDY 4000 7 4

I personally have never used dense_rank but it is a very common basic Oracle interview question. So it surely is good to keep in mind.

So where can dense_rank() be used?
Say we want to find the 3rd highest salary being paid (NOT the person with the 3rd highest salary. Just the 3rd Highest salary as number)

select * from (
    select EMP.*, 
    dense_rank() over (order by SALARY desc) as SAL_DENSE_RANK 
    from EMPLOYEE EMP
) where SAL_DENSE_RANK = 3
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_DENSE_RANK
102 KEVIN 6000
104 ANGELA 6000
101 PAM 6000

We would get all 3 rows. The 3rd highest salary is 6000 (After 9k to MICHAEL and 7.5k to JIM and DWIGHT)

In case we want to get the employee who is paid the 3rd Highest amount in the employee table, we could write the query

select * from 
    (select EMP.*, 
    rank() over(order by SALARY desc) as SAL_RANK 
    from EMPLOYEE EMP) 
WHERE SAL_RANK = 3;
Enter fullscreen mode Exit fullscreen mode

Now, this will return 0 rows.
Why? You guessed it. It's because no row is actually assigned the value 3 for its rank even though we have 7 employees.

If we try this with dense_rank,

select * from 
    (select EMP.*,
    dense_rank() over(order by SALARY desc) as DENSE_SAL_RANK
    from EMPLOYEE EMP)
WHERE DENSE_SAL_RANK = 3
Enter fullscreen mode Exit fullscreen mode

We would find a value 3, but we would get 2 rows but these are employees with salary 6000. And by looking at the dataset we know 3 is either Jim or Dwight.
This is the perfect use case of the last function we are looking at.

row_number()

Let's see it in action and how it assigns data differently for the exact same dataset

select EMP.*,
rank() over(order by SALARY desc) as SAL_RANK,
dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK,
row_number() over(order by SALARY desc) as SAL_ROW_NUMBER
from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_RANK SAL_DENSE_RANK SAL_ROW_NUMBER
100 MICHAEL 9000 1 1 1
105 JIM 7500 2 2 2
103 DWIGHT 7500 2 2 3
102 KEVIN 6000 4 3 5
104 ANGELA 6000 4 3 4
101 PAM 6000 4 3 6
106 ANDY 4000 7 4 7

So what happens when 2 rows have the same value? row_number randomly assigns a value to each of these conflicting rows. This means the next time you run your query, you could be seeing JIM with 3 and DWIGHT with 2.

So for the use case we were talking about - finding the person with the 3rd Highest Salary in the company, the below query would work perfectly.

select * from 
    (select EMP.*,
    row_number() over(order by SALARY desc) as SAL_ROW_NUMBER 
    from EMPLOYEE EMP) 
WHERE SAL_ROW_NUMBER = 3;
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_ROW_NUMBER
103 DWIGHT 7500 3

But(yes there is a but), this can lead to some issues when it comes to reports.
Just for the above case, sometimes you get JIM, sometimes DWIGHT. And that is surely not ideal.

Using Multiple Sort Conditions

Since we do not get a consistent match, we could set an additional sort condition. Let's assume EMPLOYEE_ID is assigned when an employee joins the organization. So a lesser employee_Id means they have been the firm longer. We could add an additional SORT based this so in case the Salary the same, seniority is the next sort.
Note : It is not ideal to use the EMPLOYEE_ID in the sort condition. If there was a column of joining date, that would be more appropriate to be sorted on.

select * from 
    (select EMP.*, 
    row_number() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_ROW_NUMBER 
    from EMPLOYEE EMP) 
WHERE SAL_ROW_NUMBER = 3;
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_ROW_NUMBER
105 JIM 7500 2

Note: it is always safe to sort your data such that the dataset never changes each time the query runs. This eliminates any randomness in behavior which can eventually lead to issues later on.
By Doing so, something interesting happens. We see that rank() = row_number() = dense_rank()

Running the same query with an additional sort condition

select EMP.*, 
rank() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_RANK,
dense_rank() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_DENSE_RANK,
row_number() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_ROW_NUMBER  
from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY SAL_RANK SAL_DENSE_RANK SAL_ROW_NUMBER
100 MICHAEL 9000 1 1 1
103 DWIGHT 7500 2 2 2
105 JIM 7500 3 3 3
101 PAM 6000 4 4 4
102 KEVIN 6000 5 5 5
104 ANGELA 6000 6 6 6
106 ANDY 4000 7 7 7

Well will you look at that ! Just by adding additional sort conditions, we are able to uniquely assign a value to each and every row. And because of this we do not see any difference in the value assigned by rank(), row_number() and dense_rank() since there are no conflicts.
The real difference between these functions come into play ONLY when 2 or more rows have the same value for the row that is being sorted on.

Using partition by

Phew. Now that that's out of the way, let's keep it up a notch shall we?
So far all we have done is rank the entire Table. But Oracle allows us to assign ranks for smaller subsections as well.
Let's add another column to the base dataset and change some numbers that we had to see the use of partitioning.

EMPLOYEE_ID EMPLOYEE_NAME SALARY TEAM
100 MICHAEL 9000 LEADERSHIP
101 PAM 6000 SALES
102 KEVIN 5000 ACCOUNTING
103 DWIGHT 7500 SALES
104 ANGELA 6000 ACCOUNTING
105 JIM 7500 SALES
106 ANDY 4000 SALES

We have 3 Teams in the Table - Leadership, Sales and Accounting.
Now, what Oracle lets us do is basically apply all the above 3 functions only for each team.
Let's look at how rank() would work with this dataset

select * from (
    select EMP.*, 
    rank() over (order by SALARY desc) as SAL_RANK,  
    rank() over (partition by TEAM order by SALARY desc) as TEAM_SAL_RANK
    from EMPLOYEE EMP
) order by TEAM
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY TEAM SAL_RANK TEAM_SAL_RANK
104 ANGELA 6000 ACCOUNTING 4 1
102 KEVIN 5000 ACCOUNTING 6 2
100 MICHAEL 9000 LEADERSHIP 1 1
103 DWIGHT 7500 SALES 2 1
105 JIM 7500 SALES 2 1
101 PAM 6000 SALES 4 3
106 ANDY 4000 SALES 7 4

I have ordered the dataset by the TEAM name. If you see, for each team I have my own set of rankS. Accounting has a rank 1 and 2. LEADERSHIP is a single employee so there is only 1 rank. There are bunch of people in the SALES team, and there are 2 with the same highest salary, so both get assigned 1.

Seeing what row_number would do in the same situation

select * from (
    select EMP.*, 
    row_number() over (order by SALARY desc) as SAL_ROW_NUMBER,  
    row_number() over (partition by TEAM order by SALARY desc) as TEAM_SAL_ROW_NUMBER
    from EMPLOYEE EMP
) order by TEAM
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY TEAM SAL_ROW_NUMBER TEAM_SAL_ROW_NUMBER
104 ANGELA 6000 ACCOUNTING 4 1
102 KEVIN 5000 ACCOUNTING 6 2
100 MICHAEL 9000 LEADERSHIP 1 1
103 DWIGHT 7500 SALES 2 1
105 JIM 7500 SALES 3 2
101 PAM 6000 SALES 5 3
106 ANDY 4000 SALES 7 4

So what if we wanted the details of the highest salary per Team? If you didn't have analytic functions you would probably need to know what are the various teams and then write a bunch of the below statements.

select EMP.* 
from EMPLOYEE 
WHERE TEAM = "SALES" 
AND SALARY = 
    (select MAX(SALARY) 
    from EMPLOYEE 
    WHERE TEAM = "SALES");
Enter fullscreen mode Exit fullscreen mode

Doing this without analytic functions gets exponentially harder when you want say the 2nd or 3rd highest Salary.

Using analytic functions, irrespective of the number of teams, I can do the below

select * from (
    select EMP.*, 
    row_number() over (partition by TEAM order by SALARY desc) as TEAM_SAL_ROW_NUMBER
    from EMPLOYEE EMP
) WHERE  TEAM_SAL_ROW_NUMBER = 1;
Enter fullscreen mode Exit fullscreen mode
EMPLOYEE_ID EMPLOYEE_NAME SALARY TEAM TEAM_SAL_ROW_NUMBER
104 ANGELA 6000 ACCOUNTING 1
100 MICHAEL 9000 LEADERSHIP 1
103 DWIGHT 7500 SALES 1

Handling NULL

Another common edge case that can lead to random or unexpected behaviour is when we are trying to sort on a column, and some rows have the value NULL. In such cases, it is advisable to add an NVL condition to the sort statement so as to correctly order the data. In the above example of salary, in case due to an error in the system 1 row has a value NULL, and you sorting the salarys decending, you could use the below work around.

select EMP.*, 
rank() over(order by nvl(SALARY,0) desc) as SAL_RANK 
from EMPLOYEE EMP
Enter fullscreen mode Exit fullscreen mode

If any value is NULL, it will substitue the value with 0 only for the sorting. This will handle the situation where any row does not have salary populated.


Questions

Below are a few examples for you to try out to see if you have got the hang of it.

Source Data set

EMPLOYEE_ID EMPLOYEE_NAME SALARY TEAM
100 MICHAEL 9000 LEADERSHIP
101 PAM 6000 SALES
102 KEVIN 5000 ACCOUNTING
103 DWIGHT 7500 SALES
104 ANGELA 6000 ACCOUNTING
105 JIM 7500 SALES
106 ANDY 4000 SALES

Question 1 :

Write a query using an Analytic Function to get the details of the employee with the 2nd lowest Salary

Question 2 :

Write a query using an Analytic Function to get the details of the employee with the highest salary from the group of employees who earn less than or equal to 7000


select *, 
rank() over(order by SALARY desc) as SAL_RANK,
dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK,
row_number() over(order by SALARY desc) as SAL_ROW_NUMBER
from EMPLOYEE
WHERE SAL_RANK= 1 
Enter fullscreen mode Exit fullscreen mode

Question 3

Is the above syntax correct?


I would love comments, feedback on this article and suggestions for future ones. I hope this helps someone somewhere!

Discussion (0)