Pandas is one of my favorite Python libraries, and I use it every day. A very common action is to add a column to a DataFrame. This is a pretty basic task. I’m going to look at a few examples to better show what is happening when we add a column, and how we need to think about the index of our data when we add it.
Let’s start with a very simple DataFrame. This DataFrame has 4 columns of random floating point values. The index of this DataFrame will also be the default, a RangeIndex of the size of the DataFrame. I’ll assume this python code is run in either a Jupyter notebook or ipython session with pandas installed. I used version 1.1.0 when I wrote this.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(6,4), columns=['a', 'b', 'c', 'd']) display(df) a b c d 0 0.028948 0.613221 0.122755 0.754660 1 0.880772 0.581651 0.968752 0.551583 2 0.107115 0.511918 0.574167 0.871300 3 0.830062 0.622413 0.118231 0.444581 4 0.264822 0.370572 0.001680 0.394488 5 0.749247 0.412359 0.092063 0.350451
Let’s start with the simplest way to add a column, such as a single value. This will be applied to all rows in the DataFrame.
df['e'] = .5 display(df['e']) 0 0.5 1 0.5 2 0.5 3 0.5 4 0.5 5 0.5 Name: e, dtype: float64
Now, under the hood, pandas is making life easier for you and taking your scalar value (the 0.5) and turning it into an array and using it to build a Series with the index (in this case a
RangeIndex) of your DataFrame.
This is sort of the equivalent:
df['e_prime'] = pd.Series(.5, index=pd.RangeIndex(6))
You can also pass in an array yourself without an index, but it must match the dimensions of your DataFrame
df['f'] = np.random.rand(6,1)
If you try to do this with a non-matching shape, it won’t work. This is because the DataFrame won’t know where to put the values. You can try it and see the Exception that pandas raises.
Now what happens when the data you want to add doesn’t match your current DataFrame, but it does have an index? Specifically, what if the index is different on the right hand side?
df['g'] = pd.Series(np.random.rand(50), index=pd.RangeIndex(2,52)) display(df[['e', 'e_prime', 'f', 'g']]) e e_prime f g 0 0.5 0.5 0.777879 NaN 1 0.5 0.5 0.621390 NaN 2 0.5 0.5 0.294869 0.283777 3 0.5 0.5 0.024411 0.695215 4 0.5 0.5 0.173954 0.585524 5 0.5 0.5 0.276633 0.751469
So what happened here? Our column
g only has values at rows 2 through 5, even though we assigned a series with 50 values. Well, these were the rows that matched our index. For the rows that didn’t have values, a
NaN was inserted. You can try doing this where none of the data matches on the index and see what happens. You’ll end up with a full column of
NaNs. Another way to think of this is that we could use the
loc method to select the rows we wanted to update, but unless we set the index on the right hand side, we still need to align with the shape of the DataFrame.
df.loc[2:5, 'g_prime'] = np.random.rand(4) display(df['g_prime']) 0 NaN 1 NaN 2 0.130246 3 0.419122 4 0.312587 5 0.101704 Name: g_prime, dtype: float64
The main lesson here is to realize that assigning a column to a DataFrame can lead to some surprising results if you don’t realize whether what you are assigning has a matching index or not.