loading...
Cover image for Data Munging with Common Pandas Operations

Data Munging with Common Pandas Operations

charlesdlandau profile image Charles Landau Originally published at charlesdlandau.io ・18 min read

Why?

Why am I writing the billionth article on Pandas data munging? Aren't the other articles enough? Maybe! If you already read a few of those and you feel like you know a lot about Pandas maybe your time would be better spent moving on to some of these materials:

Getting started

Quick note: throughout this article I'm going to refer to features and observations. When you start working with multidimensional problems it's an important distinction, but in this article a feature is the same as a spreadsheet column, and an observation is the same as a spreadsheet row.

import numpy as np
import pandas as pd

We're going to work with some random data, which you can construct directly. Pandas DataFrames are constructed from Pandas Series objects, which in turn use numpy objects and operations for many of their behaviors.

# Create a DataFrame with some randomized columns
df = pd.DataFrame({
    "str_feature": [np.random.choice(["dog", "cat", "snake"]) for _ in range(10000)],
    "int_feature": np.arange(10000)
})

The .head method can grab the top n rows, five by default.

df.head()
str_feature int_feature
0 snake 0
1 dog 1
2 cat 2
3 cat 3
4 dog 4
df.head(10)
str_feature int_feature
0 snake 0
1 dog 1
2 cat 2
3 cat 3
4 dog 4
5 snake 5
6 snake 6
7 dog 7
8 cat 8
9 dog 9

Of course in some circumstances we don't want to take the first 5 or 10. We can take a pseudo-random sample of the data using .sample. Note that .sample defaults to 1 observation.

df.sample(5)
str_feature int_feature
2476 cat 2476
2793 dog 2793
3120 snake 3120
9338 snake 9338
3593 dog 3593

We can get the underlying types by accessing the .dtypes attribute.

df.dtypes
str_feature    object
int_feature     int32
dtype: object

It should be stressed that object features are less memory efficient than int features. A discussion of the basic dtypes is included in the official getting started guide.

Descriptive statistics can be reported using .describe -- note that only numerical features will be summarized.

df.describe()
int_feature
count 10000.00000
mean 4999.50000
std 2886.89568
min 0.00000
25% 2499.75000
50% 4999.50000
75% 7499.25000
max 9999.00000

The descriptive statistics in this report have their own methods, for example here's std:

df.std()
int_feature    2886.89568
dtype: float64

pandas-profiling is a new-ish package that creates much more detailed reports from a DataFrame. You can check it out here.

.size returns the number of observations multiplied by the number of features.

df.size
20000

.info returns metadata about the DataFrame

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
str_feature    10000 non-null object
int_feature    10000 non-null int32
dtypes: int32(1), object(1)
memory usage: 78.2+ KB

Selection and Assignment

We tend to access data from the DataFrame using the following patterns:

  1. df[condition]
  2. df.iloc & df.loc
  3. df.at & df.iat
# First pattern
df[df['str_feature'] == 'snake'].head()
str_feature int_feature
0 snake 0
5 snake 5
6 snake 6
14 snake 14
18 snake 18
# Second pattern
df.loc[:, "int_feature"]
0          0
1          1
2          2
3          3
4          4
        ... 
9995    9995
9996    9996
9997    9997
9998    9998
9999    9999
Name: int_feature, Length: 10000, dtype: int32

Note that here the lone : means to access everything on that axis, the same way my_list[:] tells a plain Python list to access everything.

.iloc behaves similarly but works on axes only.

df.iloc[0:5]
str_feature int_feature
0 snake 0
1 dog 1
2 cat 2
3 cat 3
4 dog 4
df.iloc[[0,5]]
str_feature int_feature
0 snake 0
5 snake 5

.at and .iat are used similarly to .loc and .iloc but they can't retrieve more than one observation.

df.at[0:5, 'str_feature'] will give a ValueError because of 0:5.

df.at[0] will give a TypeError because there is no col specified.

df.at[0, 'str_feature']
'snake'

There is a discussion on SO about how you can get some performance gains by accepting the limitations of .at.

There's one more method I want to cover for accessing data from our DataFrame:

df.int_feature.head()
0    0
1    1
2    2
3    3
4    4
Name: int_feature, dtype: int32

This works fine until:

  1. Your feature is named my super feature and you try to access df.my super feature
  2. You forget that the reverse operation is illegal.

That's right, you cannot do this:

df.new_feature = df[df['int_feature'] == 1]
c:\users\cdl\documents\personalproject\common_pandas_blog\.venv\lib\site-packages\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  """Entry point for launching an IPython kernel.

Trying to will trigger a warning, at least in in newer versions of Pandas. We can confirm that the feature was not added to the DataFrame:

df.head()
str_feature int_feature
0 snake 0
1 dog 1
2 cat 2
3 cat 3
4 dog 4

Instead, let's add a new feature like so:

df['new_feature'] = df.loc[:, 'str_feature'] == 'snake'

Now we can see that the new feature was added properly:

df.head()
str_feature int_feature new_feature
0 snake 0 True
1 dog 1 False
2 cat 2 False
3 cat 3 False
4 dog 4 False

Manipulation

Let's perform some common tasks to manipulate our data. Namely:

  1. Filter values
  2. Handle missing or invalid values
  3. SQL-like join two DataFrames
  4. SQL-like groupby

1. Filtering values

To filter values, we can simply overwrite our df variable like so:

# We're cat people now.
df = df[df.loc[:, 'str_feature'] == 'cat']
df.head()
str_feature int_feature new_feature
2 cat 2 False
3 cat 3 False
8 cat 8 False
10 cat 10 False
11 cat 11 False

Since we've just filtered our DataFrame to use only use cat rows, our snakes-only new_feature will be full of False values and str_feature will be all cats. Let's filter our columns.

df.drop(["new_feature", "str_feature"], axis=1, inplace=True)
df.head()
int_feature
2 2
3 3
8 8
10 10
11 11

As discussed here, the axis argument tells pandas to drop on the basis of the features axis, and the inplace argument makes this equivalent to df = df.drop(...).

As a final note, don't make the mistake of assuming that any of these methods will automatically free up memory. Instead, use tools like Python del and pandas I/O features like low_memory or memory_map.

2. Handle missing or invalid values

Another common data munging task is to handle cases where the data is invalid in some way. Let's make a new, messy DataFrame:

# Create a DataFrame with some randomized columns
df = pd.DataFrame({
    "str_feature": [np.random.choice(["dog", "cat", "snake"]) for _ in range(10000)],
    "int_feature": np.arange(10000),
    "messy_feature": [np.random.choice([1, None, np.nan, False, True, -1]) for _ in range(10000)]
})

This simplest thing we could do is clean our data by filtering all observations with np.nan, using .isna:

# Note the rarely used ~ negation operator below
clean_df = df[~df['messy_feature'].isna()]
clean_df.sample(10)
str_feature int_feature messy_feature
4134 cat 4134 -1
2127 snake 2127 -1
7014 dog 7014 False
922 dog 922 False
5451 dog 5451 True
5059 cat 5059 -1
1289 snake 1289 False
5016 dog 5016 -1
9765 snake 9765 True
7094 cat 7094 1

As you can see, this method will clear just the NaN values. That's great, but what if we only want values that are truthy in Python?

df[df['messy_feature']] won't work, because it's a ValueError to filter using a column with NaN values. Instead, we have to filter the NaN, then check for truthiness. Just like with a plain Python conditional and statement, the conditions are checked left to right.

clean_df = df[~df['messy_feature'].isna() & df['messy_feature']]
clean_df.sample(10)
str_feature int_feature messy_feature
1049 snake 1049 1
4769 snake 4769 -1
4924 cat 4924 -1
6907 snake 6907 -1
5914 dog 5914 True
6291 cat 6291 1
5872 dog 5872 1
1794 cat 1794 1
8987 cat 8987 1
3848 dog 3848 True

Imputing values is easy too:

clean_df = df.copy()
clean_df['messy_feature'].fillna(clean_df['str_feature'], inplace=True)
clean_df.sample(10)
str_feature int_feature messy_feature
8063 snake 8063 snake
3901 cat 3901 cat
3702 dog 3702 -1
906 dog 906 1
8039 dog 8039 dog
2433 dog 2433 -1
4996 snake 4996 False
3015 snake 3015 snake
8307 cat 8307 1
1904 cat 1904 cat

And we can go beyond simple .fillna:

clean_df = df.copy()
clean_df['messy_feature'] = clean_df.apply(
    lambda row: row['messy_feature'] if row['messy_feature'] == -1 else row['str_feature'],
    axis=1
)
clean_df.sample(10)
str_feature int_feature messy_feature
9217 snake 9217 snake
5018 snake 5018 snake
6383 dog 6383 dog
1297 cat 1297 -1
9701 dog 9701 dog
5192 cat 5192 cat
4018 dog 4018 dog
7619 snake 7619 snake
9890 dog 9890 dog
5102 snake 5102 snake

Just be careful with .apply as it has the potential to significantly degrade performance. See: enhancing performance in the Pandas docs.

3. SQL-like joins

I would definitely recommend that you learn about SQL joins and then just map that knowledge to Pandas joins. The arguments like on, how, left, and right all make sense if you just keep in mind how the joins would work in SQL.

merged_df = df.copy().join(df.copy(), how='left', lsuffix="_l", rsuffix="_r")
merged_df
str_feature_l int_feature_l messy_feature_l str_feature_r int_feature_r messy_feature_r
0 snake 0 False snake 0 False
1 cat 1 NaN cat 1 NaN
2 snake 2 False snake 2 False
3 dog 3 False dog 3 False
4 snake 4 None snake 4 None
... ... ... ... ... ... ...
9995 snake 9995 False snake 9995 False
9996 dog 9996 True dog 9996 True
9997 snake 9997 NaN snake 9997 NaN
9998 dog 9998 NaN dog 9998 NaN
9999 dog 9999 1 dog 9999 1

10000 rows × 6 columns

4. SQL-like groupby

Like the SQL JOIN advice, I strongly recommend learning about SQL GROUP BY and letting that be the basis of your Pandas .groupby knowledge. Here's a simple GROUP BY SUM in Pandas

df.groupby('str_feature').sum()
int_feature
str_feature
cat 16476822
dog 16729175
snake 16789003

Note that .groupby must be called by a method that aggregates the grouping. Let's see what happens if we neglect that part...

df.groupby('str_feature')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A6F91D0>

.groupby is particularly powerful when combined with the .apply method...but my earlier warning about the performance of .apply still holds.

# Simple sum/size calculation
df.groupby(['str_feature']).apply(lambda x: x.sum()/x.size)
int_feature messy_feature
str_feature
cat 2485.191855 0.088688
dog 2509.627213 0.086709
snake 2504.326223 0.089499

Moving Windows

Often we want to calculate rolling values (like rolling sums):

df['rolling'] = df.rolling(3).sum()
df.head(10)
str_feature int_feature messy_feature rolling
0 snake 0 False NaN
1 cat 1 NaN NaN
2 snake 2 False 3.0
3 dog 3 False 6.0
4 snake 4 None 9.0
5 snake 5 NaN 12.0
6 snake 6 None 15.0
7 cat 7 -1 18.0
8 cat 8 NaN 21.0
9 snake 9 NaN 24.0

Indexing by Time

A related task is to use datetime, which we accomplish using pd.to_datetime. There are helper methods like .astype for casting to datetime types, as well as .set_index if we want to use a datetime column as our index.

import time
df['Datetime'] = pd.to_datetime([time.time() - (86400*x) for x in range(df.shape[0])], unit='s')

# Round to nearest day
# see https://stackoverflow.com/a/13788301/10553976
df['Datetime'] = np.round(df['Datetime'].astype(np.int64), -9).astype('datetime64[ns]')

# Set as index
df = df.set_index('Datetime')

I/O

One of the most attractive features of Pandas is the robust I/O capabilities. The whole list is here but let's take a look at two important ones, CSV and SQL.

CSV

Writing a CSV in Pandas couldn't be easier:

df.to_csv('./data.csv')

Reading it back in is just as simple:

df = pd.read_csv('./data.csv')

But wait: let's look at our df now...

df.head()
Datetime str_feature int_feature messy_feature rolling
0 2019-10-08 21:51:50 snake 0 False NaN
1 2019-10-07 21:51:50 cat 1 NaN NaN
2 2019-10-06 21:51:50 snake 2 False 3.0
3 2019-10-05 21:51:50 dog 3 False 6.0
4 2019-10-04 21:51:50 snake 4 NaN 9.0

We can't infer that any particular column is the index, so .read_csv created a DataFrame with a simple index instead of using the DatetimeIndex we set before writing to a file. Of course we could always use .set_index after the fact, but in resource constrained scenarios we want control of how large the created DataFrame will be. Searching for a solution, we come to the .read_csv documentation and the power of this tool immediately becomes apparent:

  • Read CSVs from URL (!!!!!), path, or file-like
  • Specify the exact window in a CSV file to read from with kwargs like header, index_col, usecols, nrows, skiprows, and skipfooter
  • Manage NaN with kwargs like na_filter and na_values
  • sep for handling tsv and other delimiter situations
  • Kwargs to help save memory like low_memory, memory_map, and dtype

Armed with tons of new knowledge about how to read csv files with Pandas, we can confidently read our data.csv file with the DatetimeIndex specified.

df = pd.read_csv('./data.csv', index_col='Datetime')
df.head()
str_feature int_feature messy_feature rolling
Datetime
2019-10-08 21:51:50 snake 0 False NaN
2019-10-07 21:51:50 cat 1 NaN NaN
2019-10-06 21:51:50 snake 2 False 3.0
2019-10-05 21:51:50 dog 3 False 6.0
2019-10-04 21:51:50 snake 4 NaN 9.0

Just the way it originally was! Let's clean that file up:

import os
os.remove('./data.csv')

Note: if you aren't using Pandas features for data manipulation, consider using the stdlib csv library for I/O instead of Pandas. Pandas objects create some overhead in memory, and Pandas itself is a nontrivial dependency.

SQL

If you want to read and write data that lives in a SQL database, Pandas uses the pd.read_sql and .to_sql methods to interface with databases. In this example we'll use the stdlib sqlite3 dbAPI, but Pandas can integrate with any kind of DB that is supported by sqlalchemy. Also note that we're going to use a try...finally block to enforce .close always being called.

import sqlite3

# DB operations
try:
    # Connect to db
    conn = sqlite3.connect('data.db')
    # Write dataframe to table 'df'
    df.to_sql('df', conn, if_exists='replace')

    # Read a SELECT * query of the same table, specifying index column
    sqldf = pd.read_sql('SELECT * FROM df', conn, index_col='Datetime')

# Ensure the connection closes
finally:
    conn.close()

# Cleanup file
os.remove('data.db')
sqldf.head()
str_feature int_feature messy_feature rolling
Datetime
2019-10-08 21:51:50 snake 0 False NaN
2019-10-07 21:51:50 cat 1 None NaN
2019-10-06 21:51:50 snake 2 False 3.0
2019-10-05 21:51:50 dog 3 False 6.0
2019-10-04 21:51:50 snake 4 None 9.0

Once again our data has made a round trip from memory to disk and back again. Well done Pandas and well done to you, dear reader!

Wrapping Up

In this article you've learned how Pandas implements some basic data munging tasks, specifically:

  1. Extracting metadata and descriptive statistics
  2. Filtering, imputing, joining and grouping data
  3. Windowed functions (like rolling sum)
  4. Using Datetime data and indexing by time
  5. Reading and writing data to files and databases

Happy wrangling!

Image by Cimberley on Pixabay

Posted on by:

charlesdlandau profile

Charles Landau

@charlesdlandau

Data Scientist | Sr. Consultant at Guidehouse

Discussion

markdown guide
 

There are indeed lots of data munging tuts out there but this one is a great one. Super concise and helpful