DEV Community

Cover image for Data Science: Saving Pandas Dataframe the Efficient Way
Azad Kshitij
Azad Kshitij

Posted on • Updated on

Data Science: Saving Pandas Dataframe the Efficient Way

So you are done processing your data, now what? You might want to send the processed data to someone else or save the result for later. In such case you you will use pandas method to_format(file_name). let say you want to save to .csv you will write .to_csv("csv.csv"). The most common way is to save it as .csv form because is the most versatile and anyone can open and understand the content of file.

But But But saving it to .csv is not the best way to save a DataFrame, as there are many flaws of saving it to a .csv file. Let say our original dataframe had some columns with object as a datatype and durign the processing we converted it into a category datatype now if we save the dataframe to a .csv, it will loos the changed datatype and we have to start all over again. Now what is the solution?

Pandas support saving to many other file formats such as parquet, feather, pkl, hd5 etc. All these formats saves the information about the datatype. Lets do some comparison between all these formats.

  • I used this dataset to perform the benchmarking.
  • I read the CSV file and saved them into different files.
  • Used seaborn to plot the graphs.

Write Speed

Write Speed

From the above graph we can see that saving to csv takes the most time and all other formats takes significantly less time to save. Feather is the clear winner followed by h5 and pkl then parquet.

Read Time

Read Time

From the above graph we can see that reading a csv takes the most time and all other formats takes significantly less time to read. PKL is the clear winner followed by feather and h5 then parquet.

File Size

File Size

Now this is something interesting as you can see parquet is the clear winner and the runner up feather takes almost 3 times more than that.

From the above results you can decide by yourself which format you want to use. You might stick to csv if you are working with non technical people, but if you are working in tech and collaborating you might find parquet the perfect fit for sharing online because of its low file size.

If you have anything to say comment down below I'm new to blog writing so any type of feedback is appreciated Thanks.

Top comments (0)