In SQL handlingn duplicate records is essential for maintaining data accuracy, optimizing query performance, and ensuring consistent results.This article explores some practicle techniques to identify and delete duplicate rows using SQL queries.
Delete Duplicates Using a Unique Identifier
Consider the above code block
DELETE FROM cars
WHERE id IN (
SELECT MAX(id)
FROM cars
GROUP BY model, brand
HAVING COUNT(1) > 1
);
The above code selects the records from cars and in the above code we want to delete the record that are duplicate in the model and brand we use the id to identify the MAX(we can only get one max) and we delete the only max record
Delete Duplicates Using a Self-Join
In this approach, we use a self-join to identify and delete duplicate rows in the cars table, keeping only the row with the smallest id for each unique combination of model and brand
DELETE FROM cars
WHERE id IN (
SELECT c2.id
FROM cars c1
JOIN cars c2 ON c1.model = c2.model
AND c1.brand = c2.brand
WHERE c1.id < c2.id
);
Delete Duplicates Using a Window Function
DELETE FROM cars
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY model, brand ORDER BY id) AS rn
FROM cars
) AS x
WHERE x.rn > 1
);
In the above case in the inner subquery, the ROW_NUMBER() function assigns a unique row number to each row within each group of duplicates (defined by model and brand)
Using MIN function
delete from cars
where id not in ( select min(id)
from cars
group by model, brand);
The inner subquery SELECT MIN(id) FROM cars GROUP BY model, brand finds the lowest id for each unique combination of model and brand. This ensures that only one record for each car model and brand pair is retained
The DELETE FROM cars WHERE id NOT IN (...) statement removes records with IDs that aren't the minimum for their model and brand group. Essentially, this keeps only the oldest record and removes duplicates.
Top comments (0)