Category type in pandas

Python pandas library supports a data type called Category. When working with pandas dataframe, using Category will help in many ways. Let's see about Category datatype.

What is Category data type in pandas?

  • Category is a datatype which can be used when we have a fixed number of string values like
    • Months(Jan, Feb)
    • Country Names(India, Singapore)
    • Size(Small, Medium, Large)
  • In a simple way is using a sequence of integer values for the strings(Jan - 1, Feb - 2 etc)
  • Categories are similar to ENUM data types in other programming languages like C/C++, Java.

Advantages of using Category:

  1. Saving lot of memory by reducing the size
  2. Increasing processing speed

How to use Category in pandas dataframe:

- While reading the CSV file:

We can convert column from object to category while reading the file like below

filename = "~/Downloads/US_Accidents_Dec20.csv"
# Converting into category data type while reading CSV file
us_accidents_dec20_cat = pd.read_csv(filename, dtype = {'State' : 'category', 'City' : 'category'})
- Converting column into category type:

We can convert the column on the fly like below

# Loading csv file into data frame
filename = "~/Downloads/US_Accidents_Dec20.csv"
us_accidents_dec20_cat = pd.read_csv(filename,)

# Normal column access

# Converting to category data type
Memory comparison between Object vs Category data types:

  • Normal object column:
us_accidents_dec20['State'].memory_usage(deep=True) / 1e6
  • Category column:
us_accidents_dec20['State'].astype('category').memory_usage(deep=True) / 1e6
We can clearly observe storage space reduced from 249 to 4 which is a very huge difference.

Converting to Category data type will certainly help improve processing speed and space with a large set of data.

Happy Learning!!

P.S: Used Accidents' data of December 2020 from The USA, You can get this data from kaggle.

