In this post we will talk about writing better pandas code just by some simple set of rules and using inbuild function provided by pandas. Your code might be working before but these changes will make your code more readable and more manageable and some time run faster. Don't let the reader know that you are new.
- Writing into csv with unnecessary index.
When you save a DataFrame to a csv format, by default it will be saves with a index column. This will unnecessarily increase the size of our csv file. We can change this behaviour by doing index = False
in the to_csv()
function.
df.to_csv("output_index.csv", index=False)
- Using column names which include spaces.
It is some time okay if you are sending the file to some one non technical but for actual work it sometime creates many problem. use _
instead of spaces, after doing this it will be easy to access them by .
notation in pandas.
df.columns
# Output: Index(['Series_reference', 'Period', 'Data_value', 'Suppressed', 'STATUS', 'UNITS', 'MAGNTUDE', 'Subject', 'Group', 'Series_title_1'], dtype='object')
df.Series_reference
- Filter dataset like a PRO with QUERY method
Use .query()
function provide by pandas instead of using manual conditions.
ex:
# This
df.loc[ (df.Period >= 1971) & (df.Period <= 1973) ]
# Can be simplified as this...
df.query(' 1971 <= Period <= 1973 ')
- query strings with(@ symbol) to easily reach variables
This might be bit controversial as some people like have references of their variables and to make it more readable, but just to inform you about this feature that you can use @
to reference variable in a query string. like this...
min_year = 1971
max_year = 1973
# This:
df.query(f' {min_year} <= Period <= {max_year} ')
# Can be simplified as this...
df.query(' @min_year <= Period <= @max_year ')
- "inplace" method could be removed in future versions, better explicitly overwrite modifications
inplace
method could be removed in future versions so better explicitly modify the DataFrame by reassigning.
# This
df.fillna(inplace=True)
# Should be changed to this...
df = df.fillna()
- Better Vectorization instead of iteration
As I mentioned about the performance of you function the loops are no top of that discussion. You can iterate over each row of you DataFrame but that is not a good idea if you just want to create a new column by applying conditions on you DataFrame.
%%time
for i, row in df.iterrows():
if row['Period'] < 2000:
df.loc[i, 'is_recent'] = False
else:
df.loc[i, 'is_recent'] = True
df.is_recent
"""
CPU times: total: 34.3 s
Wall time: 37.9 s
"""
But by using Vectorization we can significantly reduce the time
%%time
df['is_not_recent'] = df['Period'] >= 2000
"""
CPU times: total: 15.6 ms
Wall time: 993 µs
"""
- Vectorization method are preferable than Apply
The apply method does take less time then iteration it is still not as good as the vectorization.
%%time
df['recent'] = df.apply(lambda row : row['Period'] < 2000, axis=1)
df.recent
"""
CPU times: total: 2.59 s
Wall time: 2.92 s
"""
- df.copy() method
If you want to use a single column as a new DataFrame then you shoud use .copy()
otherwise if you change the newly created dataframe it will also change the same in original dataframe.
new_df = df.copy()
- Chaining formulas is better than creating many intermediate dataframes
Chaining operations will be a lot more useful than you think just look at this example...
# This
df1 = df.query('Period > 2000')
df2 = df1.groupby(["STATUS"])[['Period']].min()
df_out = df2.sort_values('Period')
# Can be simplified as this...
df_out = (df.query('Period > 2000')
.groupby(["STATUS"])[['Period']].min()
.sort_values('Period'))
df_out
Now, you decide by your self which one you want to use.
- Properly set column dtypes
When it comes to memory pandas can be a 'memory hog' but there are ways to keep the memory usage low. You can try changing dtypes
of a column. Sometime you have a column that stores 'Y' for True and 'N' for 'False' you can change the type of that column which will speedup the operations on that column as well as reduce the size of DataFrame. Try applying these conversions...
string -> category or bool or number
float63 -> Smaller float values like float32 or float16
int64 -> smaller int values like int32 or int16
These are the 10 ways in which you can improve your pandas code, in future article we will talk about more ways to improve your datascience projects. If you have anything to say comment down below I'm new to blog writing so any type of feedback is appreciated Thanks.
Top comments (0)