Do you have a bunch of raw data stored in your Oracle Database and do you need to derive something meaningful out of the raw data?
Depending on your DB structure, you might only need basic Joins and Filters. For more complex data sets and reporting requirements, you will probably need to use Analytic functions to convert your data into something more meaningful.
There are a huge bunch of these functions and using them can become a little bit tricky if it's the first time you are working on some sort of DB data manipulation.
In this series of posts, I'm going to try and slowly expose you to the basic analytic functions(at least the commonly used ones), describe them and break down how you can really configure them to suit your exact needs.
By the end of this, I hope to empower you with the knowledge to convert Data set A(your source data) into Data set B(your target data)
Part 1 will look at the 3 easier and commonly used functions: rank(), row_number() and dense_rank()
Let's take this sample Data set to explain these functions
We have a Table EMPLOYEE with 7 rows of data.
Let's run a basic select statement to see what data is present in this table
select EMP.* from EMPLOYEE EMP
Here EMPLOYEE_ID is the primary key of the table. We have 7 employes with their EMPLOYEE_NAME and SALARY stored.
All 3 functions will assign a number to each row based on the criteria we give it, but how it gets assigned for edge cases is what is different about each. Let's take a look at them one by one.
The syntax for applying this would be
select EMP.*, rank() over(order by SALARY desc) as SAL_RANK from EMPLOYEE EMP
Now let's take a look at what we just wrote - we called the
rank() function but we also mentioned some additional text
over (order by SALARY).
So why do we need this and what are we doing here?
We basically need to specify how to do the ranking so Oracle knows how rank() needs to be applied.
over is basically telling Oracle - Over what dataset and how to apply the rank. This is going to be used for all analaytic functions to add configuration.
order by SALARY tells Oracle to sort the data in ascending order. By default all sorts are ascending.
The 3 ways of writing this are
rank() over (order by SALARY)
rank() over (order by SALARY ASC)
rank() over (order by SALARY DESC)
Since the default sort is Ascending, Line 1 and Line 2 would give you the same result.
Now that we have gone over the syntax. Let's look at what the target dataset looks like after calling this and what really happens.
select EMP.*, rank() over(order by SALARY desc) as SAL_RANK from EMPLOYEE EMP
Let me order the dataset as per the Rank just to understand the data better.
select * from (select EMP.*, rank() over(order by SALARY desc) as SAL_RANK from EMPLOYEE EMP) order by SAL_RANK
As mentioned, this function assigns a number to each and every row.
We also see that if there are rows in our table with the same value for Salary, they get assigned the same rank!
Lets quickly look at our data. We have 1 employee MICHAEL who gets paid 9000 and is assigned Rank 1. We have 2 Employees JIM and DWIGHT who are both assigned Rank 2 because they are both earning 7500. Now, since we have both of them assigned to 2, the rank function skips assigning a value 3 and next assigns 4 to whoever is next in the data after sorting. We see there are 3 people with salary 6000. All 3 are assigned the Rank 4. Finally Rank 7 is assigned to the person with 4000.
Say in this case ANDY was not a part of your dataset, the maximum rank would be 4. Hence it is not necessary that the lowest rank equals the total number of rows in the dataset.
The syntax for applying dense_rank() would be similar to rank()
select EMP.*, rank() over(order by SALARY desc) as SAL_RANK, dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK from EMPLOYEE EMP
Dense Rank is very similar to Rank. The only difference is that it does not skip any numbers when assigning a value to a row. Hence, for our example, even though we have 7 employees, our dense_rank value will be 1,2,3 and 4.
I personally have never used dense_rank but it is a very common basic Oracle interview question. So it surely is good to keep in mind.
So where can dense_rank() be used?
Say we want to find the 3rd highest salary being paid (NOT the person with the 3rd highest salary. Just the 3rd Highest salary as number)
select * from ( select EMP.*, dense_rank() over (order by SALARY desc) as SAL_DENSE_RANK from EMPLOYEE EMP ) where SAL_DENSE_RANK = 3
We would get all 3 rows. The 3rd highest salary is 6000 (After 9k to MICHAEL and 7.5k to JIM and DWIGHT)
In case we want to get the employee who is paid the 3rd Highest amount in the employee table, we could write the query
select * from (select EMP.*, rank() over(order by SALARY desc) as SAL_RANK from EMPLOYEE EMP) WHERE SAL_RANK = 3;
Now, this will return 0 rows.
Why? You guessed it. It's because no row is actually assigned the value 3 for its rank even though we have 7 employees.
If we try this with dense_rank,
select * from (select EMP.*, dense_rank() over(order by SALARY desc) as DENSE_SAL_RANK from EMPLOYEE EMP) WHERE DENSE_SAL_RANK = 3
We would find a value 3, but we would get 2 rows but these are employees with salary 6000. And by looking at the dataset we know 3 is either Jim or Dwight.
This is the perfect use case of the last function we are looking at.
Let's see it in action and how it assigns data differently for the exact same dataset
select EMP.*, rank() over(order by SALARY desc) as SAL_RANK, dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK, row_number() over(order by SALARY desc) as SAL_ROW_NUMBER from EMPLOYEE EMP
So what happens when 2 rows have the same value? row_number randomly assigns a value to each of these conflicting rows. This means the next time you run your query, you could be seeing JIM with 3 and DWIGHT with 2.
So for the use case we were talking about - finding the person with the 3rd Highest Salary in the company, the below query would work perfectly.
select * from (select EMP.*, row_number() over(order by SALARY desc) as SAL_ROW_NUMBER from EMPLOYEE EMP) WHERE SAL_ROW_NUMBER = 3;
But(yes there is a but), this can lead to some issues when it comes to reports.
Just for the above case, sometimes you get JIM, sometimes DWIGHT. And that is surely not ideal.
Since we do not get a consistent match, we could set an additional sort condition. Let's assume EMPLOYEE_ID is assigned when an employee joins the organization. So a lesser employee_Id means they have been the firm longer. We could add an additional SORT based this so in case the Salary the same, seniority is the next sort.
Note : It is not ideal to use the EMPLOYEE_ID in the sort condition. If there was a column of joining date, that would be more appropriate to be sorted on.
select * from (select EMP.*, row_number() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_ROW_NUMBER from EMPLOYEE EMP) WHERE SAL_ROW_NUMBER = 3;
Note: it is always safe to sort your data such that the dataset never changes each time the query runs. This eliminates any randomness in behavior which can eventually lead to issues later on.
By Doing so, something interesting happens. We see that rank() = row_number() = dense_rank()
Running the same query with an additional sort condition
select EMP.*, rank() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_RANK, dense_rank() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_DENSE_RANK, row_number() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_ROW_NUMBER from EMPLOYEE EMP
Well will you look at that ! Just by adding additional sort conditions, we are able to uniquely assign a value to each and every row. And because of this we do not see any difference in the value assigned by rank(), row_number() and dense_rank() since there are no conflicts.
The real difference between these functions come into play ONLY when 2 or more rows have the same value for the row that is being sorted on.
Phew. Now that that's out of the way, let's keep it up a notch shall we?
So far all we have done is rank the entire Table. But Oracle allows us to assign ranks for smaller subsections as well.
Let's add another column to the base dataset and change some numbers that we had to see the use of partitioning.
We have 3 Teams in the Table - Leadership, Sales and Accounting.
Now, what Oracle lets us do is basically apply all the above 3 functions only for each team.
Let's look at how rank() would work with this dataset
select * from ( select EMP.*, rank() over (order by SALARY desc) as SAL_RANK, rank() over (partition by TEAM order by SALARY desc) as TEAM_SAL_RANK from EMPLOYEE EMP ) order by TEAM
I have ordered the dataset by the TEAM name. If you see, for each team I have my own set of rankS. Accounting has a rank 1 and 2. LEADERSHIP is a single employee so there is only 1 rank. There are bunch of people in the SALES team, and there are 2 with the same highest salary, so both get assigned 1.
Seeing what row_number would do in the same situation
select * from ( select EMP.*, row_number() over (order by SALARY desc) as SAL_ROW_NUMBER, row_number() over (partition by TEAM order by SALARY desc) as TEAM_SAL_ROW_NUMBER from EMPLOYEE EMP ) order by TEAM
So what if we wanted the details of the highest salary per Team? If you didn't have analytic functions you would probably need to know what are the various teams and then write a bunch of the below statements.
select EMP.* from EMPLOYEE WHERE TEAM = "SALES" AND SALARY = (select MAX(SALARY) from EMPLOYEE WHERE TEAM = "SALES");
Doing this without analytic functions gets exponentially harder when you want say the 2nd or 3rd highest Salary.
Using analytic functions, irrespective of the number of teams, I can do the below
select * from ( select EMP.*, row_number() over (partition by TEAM order by SALARY desc) as TEAM_SAL_ROW_NUMBER from EMPLOYEE EMP ) WHERE TEAM_SAL_ROW_NUMBER = 1;
Another common edge case that can lead to random or unexpected behaviour is when we are trying to sort on a column, and some rows have the value NULL. In such cases, it is advisable to add an NVL condition to the sort statement so as to correctly order the data. In the above example of salary, in case due to an error in the system 1 row has a value NULL, and you sorting the salarys decending, you could use the below work around.
select EMP.*, rank() over(order by nvl(SALARY,0) desc) as SAL_RANK from EMPLOYEE EMP
If any value is NULL, it will substitue the value with 0 only for the sorting. This will handle the situation where any row does not have salary populated.
Below are a few examples for you to try out to see if you have got the hang of it.
Source Data set
Write a query using an Analytic Function to get the details of the employee with the 2nd lowest Salary
Write a query using an Analytic Function to get the details of the employee with the highest salary from the group of employees who earn less than or equal to 7000
select *, rank() over(order by SALARY desc) as SAL_RANK, dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK, row_number() over(order by SALARY desc) as SAL_ROW_NUMBER from EMPLOYEE WHERE SAL_RANK= 1
Is the above syntax correct?
I would love comments, feedback on this article and suggestions for future ones. I hope this helps someone somewhere!