DEV Community

wrighter
wrighter

Posted on • Originally published at wrighters.io on

Basic Pandas: Renaming a DataFrame column

A very common need in working with pandas DataFrames is to rename a column. Maybe the columns were supplied by a data source like a CSV file and they need cleanup. Or maybe you just changed your mind during an interactive session. Let’s look at how you can do this, because there’s more than one way.

Let’s say we have a pandas DataFrame with several columns.

[ins] In [1]: import pandas as pd 
         ...: import numpy as np
         ...:
         ...: df = pd.DataFrame(np.random.rand(5,5), columns=['A', 'B', 'C', 'D', 'E'])
         ...: df
Out[1]: A B C D E
0 0.811204 0.022184 0.179873 0.705248 0.098429
1 0.905231 0.447630 0.970045 0.744982 0.566889
2 0.805913 0.569044 0.760091 0.833827 0.148091
3 0.285781 0.262952 0.250169 0.496548 0.604798
4 0.420414 0.463825 0.025779 0.287122 0.880970

What if we want to rename the columns? There is more than one way to do this, and I’ll start with an indirect answer that’s not really a rename. Sometimes your desire to rename a column is associated with a data change, so maybe you just end up adding a column instead. Depending on what you’re working on, and how much memory you can spare, and how many columns you want to deal with, adding another column is a good way to work when you’re dealing with ad-hoc exploration, because you can always step back and repeat the steps since you have the intermediate data. You can complete the rename by dropping the old column. While this isn’t very efficient, for ad-hoc data exploration, it’s quite common.

df['e'] = np.maximum(df['E'], .5)

But let’s say you do want to really just rename the column in place. Here’s an easy way, but requires you do update all the columns at once.

[ins] In [4]: print(type(df.columns))
         ...:
         ...: df.columns = ['A', 'B', 'C', 'D', 'EEEE', 'e']
<class 'pandas.core.indexes.base.Index'>

Now the columns are not just a list of strings, but rather an Index, so under the hood the DataFrame will do some work to ensure you do the right thing here.

[ins] In [5]: try:
 ...:    df.columns = ['a', 'b']
 ...: except ValueError as ve:
 ...:    print(ve)
 ...:
Length mismatch: Expected axis has 6 elements, new values have 2 elements

Now, having to set the full column list to rename just one column is not convenient, so there are other ways. First, you can use the rename method. The method takes a mapping of old to new column names, so you can rename as many as you wish. Remember, axis 0 or “index” is the primary index of the DataFrame (aka the rows), and axis 1 or “columns” is for the columns. Note that the default here is the index, so you’ll need to pass this argument.

df.rename({'A': 'aaa', 'B': 'bbb', 'EEE': 'EE'}, axis="columns")

Note that by default it doesn’t complain for mappings without a match (‘EEE’ is not a column but ‘EEEE’ is in this example). You can force it to raise errors by passing in errors='raise'. Also, this method returns the modified DataFrame, so like many DataFrame methods, you need to pass inplace=True if you want to make the change persist in your DataFrame. Or you can reassign the result to the same variable.

df.rename({'A': 'aaa', 'B': 'bbb', 'EEE': 'EE'}, axis=1, inplace=True)

You can also change the columns using the set_index method, with the axis set to 1 or columns. Again, inplace=True will update the DataFrame in place (and is the default in older versions of pandas but defaults to False in versions 1.0+) if you don’t want to reassign variables.

df.set_axis(['A', 'B', 'C', 'D', 'E', 'e'], axis="columns")

The rename method will also take a function. If you pass in the function (or dictionary) as the index or columns paramater, it will apply to that axis. This can allow you to do generic column name cleanup easily, such as removing trailing whitespace like this:

df.columns = ['A ', 'B ', 'C ', 'D ', 'E ', 'e']
df.rename(columns=lambda x: x.strip(), inplace=True)

I’ll also mention one of the primary reasons of not using inplace=True is for method chaining in DataFrame creation and initial setup. Often, you’ll end up doing something like this (contrived I know).

df = pd.DataFrame(np.random.rand(2,5,), columns=np.random.rand(5)).rename(columns=lambda x: str(x)[0:5])

Which you’ll hopefully agree is much better than this.

df = pd.DataFrame(np.random.rand(2,5,), columns=np.random.rand(5))
df.columns = [str(x)[0:5] for x in df.columns]

Top comments (0)