Removing duplicates from a table can be a frustrating process. Various methods for identifying, and removing records exist. Creating a query to remove certain records and keep others requires lots of careful analysis. I recently implemented a time saving, simple solution using SQL Server's RANK
function to identify and mark records for deletion within a duplicate set.
The SQL Server RANK function returns the rank, or row number, of each row within the partitioned grouping of a result set. I'm not going to dive into the RANK function in this article. You can find more information about it here.
Using a query with RANK, I create a query that creates sets of duplicate records and assigns a number to each row in duplicate set. The row number will reset for each duplicate set in the table. I then use the number to identify rows to be deleted.
Let's set up a quick sample. Run the queries below to create a table which contains duplicate records. The test data sets up a scenario where duplicate orders were imported on a specific date, 4/20/2020.
create table duplicates (
Id int identity(1,1)
,CustomerId int
,OrderAmount numeric(7,2)
,OrderDate datetime
)
insert into duplicates (CustomerId, OrderAmount, OrderDate)
values (1, 100.45, '4/20/2020'), (1, 100.45, '4/20/2020'), (1, 33.12, '4/14/2020'), (1, 75.45, '4/2/2020'),
(2, 10.99, '4/20/2020'), (2, 10.99, '4/20/2020'), (2, 100.45, '4/14/2020'), (2, 12.75, '4/2/2020'),
(3, 55.66, '4/20/2020'), (3, 55.66, '4/20/2020'), (3, 31.44, '4/14/2020'), (3, 85.22, '4/2/2020'),
(4, 46.89, '4/20/2020'), (5, 88.33, '4/20/2020'), (6, 90.13, '4/14/2020'), (7, 3.77, '4/2/2020')
select * from duplicates
As you can see from the query results, there are duplicate orders on 4/20/2020 for CustomerId's 1, 2, and 3.
Id | CustomerId | OrderAmount | OrderDate |
---|---|---|---|
1 | 1 | 100.45 | Apr 20 2020 |
2 | 1 | 100.45 | Apr 20 2020 |
3 | 1 | 33.12 | Apr 14 2020 |
4 | 1 | 75.45 | Apr 2 2020 |
5 | 2 | 10.99 | Apr 20 2020 |
6 | 2 | 10.99 | Apr 20 2020 |
7 | 2 | 100.45 | Apr 14 2020 |
8 | 2 | 12.75 | Apr 2 2020 |
9 | 3 | 55.66 | Apr 20 2020 |
10 | 3 | 55.66 | Apr 20 2020 |
11 | 3 | 31.44 | Apr 14 2020 |
12 | 3 | 85.22 | Apr 2 2020 |
13 | 4 | 46.89 | Apr 20 2020 |
14 | 5 | 88.33 | Apr 20 2020 |
15 | 6 | 90.13 | Apr 14 2020 |
16 | 7 | 3.77 | Apr 2 2020 |
To identify the duplicate records, I create a query which uses the RANK function to assign a number to each record in each duplicate set. The PARTITION BY
clause specifies groupings of data which the function is applied. The query creates groupings of CustomerId and OrderAmount. For each grouping, a new number is assigned. In the query below, the new number is aliased as RowNumberWithinDuplicateSet.
SELECT Id, CustomerId, OrderAmount, OrderDate,
RANK() OVER (PARTITION BY CustomerId, OrderAmount ORDER BY Id DESC) AS RowNumberWithinDuplicateSet
FROM duplicates
WHERE OrderDate = '4/20/2020'
The query returns the following results:
Id | CustomerId | OrderAmount | OrderDate | RowNumberWithinDuplicateSet |
---|---|---|---|---|
2 | 1 | 100.45 | Apr 20 2020 | 1 |
1 | 1 | 100.45 | Apr 20 2020 | 2 |
6 | 2 | 10.99 | Apr 20 2020 | 1 |
5 | 2 | 10.99 | Apr 20 2020 | 2 |
10 | 3 | 55.66 | Apr 20 2020 | 1 |
9 | 3 | 55.66 | Apr 20 2020 | 2 |
13 | 4 | 46.89 | Apr 20 2020 | 1 |
14 | 5 | 88.33 | Apr 20 2020 | 1 |
Once RowNumberWithinDuplicateSet is assigned, I use it to remove one of the duplicate records in each set. Using a query like the one below, I use the modulus of RowNumberWithinDuplicateSet to find one row in each set to delete.
delete
d
from
duplicates d
inner join (
select Id,
RANK() OVER (PARTITION BY CustomerId, OrderAmount ORDER BY Id DESC) AS RowNumberWithinDuplicateSet
from duplicates
where
OrderDate = '4/20/2020'
) x on d.id = x.id
where
x.RowNumberWithinDuplicateSet % 2 = 0
After this delete
is executed, rows with Ids 1, 5, and 9 are removed, eliminating the duplicate records. This technique is really effective when the PARTITION BY columns are key fields and the duplicates can be filtered from the main table with a WHERE clause.
SQL Server's RANK function makes the complex process of identifying and removing duplicate records a cinch. The next time you need to write a query to remove duplicate rows from a table, think of the RANK function. Using the RANK function, you can create simple queries to assist you in removing duplicate records.
Top comments (2)
also useful for this ..
e.g.
We can do it by group by easily :). Btw nice approach