Eliminating Duplicate Emails Efficiently: A Comprehensive Guide to Data Cleanup Using Pandas and SQL
With this article, I will be covering the delete duplicate emails Leetcode problem.
Leetcode describes this problem as easy. That's a super reasonable evaluation as the solution requires only basic SQL or Pandas knowledge.
The problem description is as follows:
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest
For SQL users, please note that you are supposed to write a
DELETEstatement and not a
For Pandas users, please note that you are supposed to modify
After running your script, the answer shown is the
Persontable. The driver will first compile and run your piece of code and then show the
Persontable. The final order of the
Persontable does not matter.
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Using the Pandas library this can be achieved by first in-place sorting by the
id field and then dropping the duplicates from
def delete_duplicate_emails(person: pd.DataFrame) -> None: person.sort_values(by='id', inplace=True) person.drop_duplicates(subset='email', keep='first', inplace=True)
Voila! I love these simple questions.
In SQL we want to run a delete query. We will grab two copies of the
person table and match them based on the
id value. This keeps the
DELETE p1 FROM person p1, person p2 WHERE p1.email = p2.email AND p1.id > p2.id;
As with many problems, there are multiple ways to solve them. These Pandas and SQL solutions are but one way of approaching the delete duplicate question.
Originally published at https://blog.seancoughlin.me.