Azad Kshitij

Posted on Dec 21, 2022

Data Science: 10 Ways to improve your pandas code

#datascience #pandas #python #beginners

In this post we will talk about writing better pandas code just by some simple set of rules and using inbuild function provided by pandas. Your code might be working before but these changes will make your code more readable and more manageable and some time run faster. Don't let the reader know that you are new.

Writing into csv with unnecessary index.

When you save a DataFrame to a csv format, by default it will be saves with a index column. This will unnecessarily increase the size of our csv file. We can change this behaviour by doing index = False in the to_csv() function.

df.to_csv("output_index.csv", index=False)

Using column names which include spaces.

It is some time okay if you are sending the file to some one non technical but for actual work it sometime creates many problem. use _ instead of spaces, after doing this it will be easy to access them by . notation in pandas.

df.columns 
# Output: Index(['Series_reference', 'Period', 'Data_value', 'Suppressed', 'STATUS', 'UNITS', 'MAGNTUDE', 'Subject', 'Group', 'Series_title_1'], dtype='object')
df.Series_reference

Filter dataset like a PRO with QUERY method

Use .query() function provide by pandas instead of using manual conditions.
ex:

# This
df.loc[ (df.Period >= 1971) & (df.Period <=  1973) ]
# Can be simplified as this...
df.query(' 1971 <= Period <= 1973 ')

query strings with(@ symbol) to easily reach variables

This might be bit controversial as some people like have references of their variables and to make it more readable, but just to inform you about this feature that you can use @ to reference variable in a query string. like this...

min_year = 1971
max_year = 1973
# This:
df.query(f' {min_year} <= Period <= {max_year} ')
# Can be simplified as this...
df.query(' @min_year <= Period <= @max_year ')

"inplace" method could be removed in future versions, better explicitly overwrite modifications

inplace method could be removed in future versions so better explicitly modify the DataFrame by reassigning.

# This 
df.fillna(inplace=True)
# Should be changed to this... 
df = df.fillna()

Better Vectorization instead of iteration

As I mentioned about the performance of you function the loops are no top of that discussion. You can iterate over each row of you DataFrame but that is not a good idea if you just want to create a new column by applying conditions on you DataFrame.

%%time
for i, row in df.iterrows():
    if row['Period'] < 2000:
        df.loc[i, 'is_recent'] = False
    else:
        df.loc[i, 'is_recent'] = True

df.is_recent
"""
CPU times: total: 34.3 s 
Wall time: 37.9 s
"""

But by using Vectorization we can significantly reduce the time

%%time
df['is_not_recent'] = df['Period'] >= 2000
"""
CPU times: total: 15.6 ms 
Wall time: 993 µs
"""

Vectorization method are preferable than Apply

The apply method does take less time then iteration it is still not as good as the vectorization.

%%time
df['recent'] = df.apply(lambda row : row['Period'] < 2000, axis=1)
df.recent
"""
CPU times: total: 2.59 s 
Wall time: 2.92 s
"""

df.copy() method

If you want to use a single column as a new DataFrame then you shoud use .copy() otherwise if you change the newly created dataframe it will also change the same in original dataframe.

new_df = df.copy()

Chaining formulas is better than creating many intermediate dataframes

Chaining operations will be a lot more useful than you think just look at this example...

# This 
df1 = df.query('Period > 2000')
df2 = df1.groupby(["STATUS"])[['Period']].min()
df_out = df2.sort_values('Period')

# Can be simplified as this...
df_out = (df.query('Period > 2000')
    .groupby(["STATUS"])[['Period']].min()
    .sort_values('Period'))
df_out

Now, you decide by your self which one you want to use.

Properly set column dtypes

When it comes to memory pandas can be a 'memory hog' but there are ways to keep the memory usage low. You can try changing dtypes of a column. Sometime you have a column that stores 'Y' for True and 'N' for 'False' you can change the type of that column which will speedup the operations on that column as well as reduce the size of DataFrame. Try applying these conversions...

string -> category or bool or number
float63 -> Smaller float values like float32 or float16
int64 -> smaller int values like int32 or int16

These are the 10 ways in which you can improve your pandas code, in future article we will talk about more ways to improve your datascience projects. If you have anything to say comment down below I'm new to blog writing so any type of feedback is appreciated Thanks.

References

https://www.youtube.com/watch?v=_gaAoJBMJ_Q

DEV Community

Data Science: 10 Ways to improve your pandas code

References

Top comments (0)

Read next

6 Powerful Python Techniques for Efficient Graph Processing and Analysis

How AI Systems Think: New Framework Reveals Machine Reasoning Through 'Thought Logging'

Automate Your Job Search: Scraping 400+ LinkedIn Jobs with Python

Building REST API Endpoints with Django REST Framework: A Step-by-Step Guide