DEV Community

wrighter
wrighter

Posted on • Originally published at wrighters.io on

Removing duplicate data in Pandas

It can be very common when dealing with time series data to end up with duplicate data. This can happen for a variety of reasons, and I’ve encountered it more than one time when and tried different approaches to eliminate the duplicate values. There’s a gem of a solution on Stack Overflow and I thought it would be helpful to walk through the possible solutions to this issue.

To keep things simple, I’ll just work with a Series of floating point data. This could be anything, but we could pretend it’s something that’s manually maintained, like an earnings estimate for a stock, or a temperature reading, or a sales for a store on a given date.

>>> import pandas as pd
>>> import numpy as np
>>>
>>> items = pd.Series(np.random.random_sample(10) * 100, pd.date_range('2020-01-01', periods=10))
>>>
>>> items
2020-01-01 65.685092
2020-01-02 15.376536
2020-01-03 89.002061
2020-01-04 38.614841
2020-01-05 39.274465
2020-01-06 6.886290
2020-01-07 96.453986
2020-01-08 56.488125
2020-01-09 61.667168
2020-01-10 18.023665
Freq: D, dtype: float64
Enter fullscreen mode Exit fullscreen mode

At this point, we have 10 periods of data, and the index (a DatetimeIndex with 10 days) all have unique values. But let’s say in our data, corrected data appears in the same source file. I’ll do something a bit contrived here and concatenate two Series that have some of the same dates in them, but in real life you can imagine a number of ways that data will show up in your sources with duplicated data for the same time stamp.

>>> corrected = pd.Series(np.random.random_sample(3) * 75, pd.date_range('2020-01-04', periods=3))
>>> combined = pd.concat([items, corrected])
Enter fullscreen mode Exit fullscreen mode

Now, how do we get rid of this duplicated data? Let’s say that we want to only keep the most recent data in our file, assuming that it was a correction or updated value that we prefer to use. Instead of going right to the accepted solution on Stack Overflow, I’m going to work through the pandas documentation to see what the possible solutions are, and hopefully end up in the same place!

First, let’s see if we can answer the question of whether our data has duplicate items in the index. In the pandas docs, we see a few promising methods, including a duplicated method, and also a has_duplicates property. Let’s see if those report what we expect.

>>> combined.index.has_duplicates
True
Enter fullscreen mode Exit fullscreen mode

Now the methods available to look at are duplicated and drop_duplicates. For duplicated, the method will return an array of boolean values, where True indicates the duplicate. You can use the keep argument to keep either the first (default) or last occurrence of the value in your index. In drop_duplicates, you get an Index returned with the duplicates already removed, and you can pass in the same keep argument with the same meaning.

>>> combined.index.duplicated(keep='last')
array([False, False, False, True, True, True, False, False, False,
       False, False, False, False])
>>> combined.index.drop_duplicates(keep='last')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-07',
               '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq=None)
Enter fullscreen mode Exit fullscreen mode

Ok, so what do we do now with these two options? The first boolean array can be used to just pick the values that we want to keep, but the True values are the ones we want to drop. That is pretty easy, just invert it with a ~. This can be used to select the values you want out of the array, and gets us to a good solution. We need to sort the index since it is not in chronological order.

>>> ~combined.index.duplicated(keep='last')
array([ True, True, True, False, False, False, True, True, True,
        True, True, True, True])
>>> combined[~combined.index.duplicated(keep='last')].sort_index()
2020-01-01 65.685092
2020-01-02 15.376536
2020-01-03 89.002061
2020-01-04 28.990089
2020-01-05 22.371910
2020-01-06 21.548301
2020-01-07 96.453986
2020-01-08 56.488125
2020-01-09 61.667168
2020-01-10 18.023665
dtype: float64
Enter fullscreen mode Exit fullscreen mode

Now if we want to use the second method, drop_duplicates, we need to find a way to use that to grab the values out of our Series that we want to keep. This is a bit more complicated. First, we can use the reset_index method which is a handy way to take the index (in our case a DatetimeIndex) and turn it into a column on a DataFrame momentarily with a new regular, non-repeating index. Now you can see that our Series looks a little different, but the new column can now be used to remove duplicates.

>>> combined.reset_index()
        index 0
0 2020-01-01 65.685092
1 2020-01-02 15.376536
2 2020-01-03 89.002061
3 2020-01-04 38.614841
4 2020-01-05 39.274465
5 2020-01-06 6.886290
6 2020-01-07 96.453986
7 2020-01-08 56.488125
8 2020-01-09 61.667168
9 2020-01-10 18.023665
10 2020-01-04 28.990089
11 2020-01-05 22.371910
12 2020-01-06 21.548301
Enter fullscreen mode Exit fullscreen mode

Now, we can use drop_duplicates, but we’ll use the DataFrame version of the method which has a subset argument that can be used to only consider a certain column (our new ‘index’ column) for duplicates to drop. Now since this is now a DataFrame and not a Series, we will reset the index to our index column using set_index and return the column 0. This gives us the same result as the earlier method, but in a much more roundabout way. I also find it a lot more confusing.

>>> combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')[0].sort_index()
index
2020-01-01 65.685092
2020-01-02 15.376536
2020-01-03 89.002061
2020-01-04 28.990089
2020-01-05 22.371910
2020-01-06 21.548301
2020-01-07 96.453986
2020-01-08 56.488125
2020-01-09 61.667168
2020-01-10 18.023665
Name: 0, dtype: float64
Enter fullscreen mode Exit fullscreen mode

One other way to do this is to use groupby and a grouping function (in this case the last) to select the values we want. This method provides us with sorted output and also looks simple.

>>> combined.groupby(combined.index).last()
2020-01-01 65.685092
2020-01-02 15.376536
2020-01-03 89.002061
2020-01-04 28.990089
2020-01-05 22.371910
2020-01-06 21.548301
2020-01-07 96.453986
2020-01-08 56.488125
2020-01-09 61.667168
2020-01-10 18.023665
dtype: float64
Enter fullscreen mode Exit fullscreen mode

What’s the best way to do this? Like the question on Stack Overflow, I prefer the first method for readability, but the last is also pretty simple. One good argument for choosing the first method is speed.

%timeit combined[~combined.index.duplicated(keep='last')].sort_index()
282 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')[0].sort_index()
1.56 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit combined.groupby(combined.index).last()
578 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode

Well, after digging through all that, I hope you understand a bit more about how to remove duplicate items from a Series or DataFrame and why some methods might be better to choose than others.

Top comments (0)