Python pandas library supports a data type called Category. When working with pandas dataframe, using Category will help in many ways. Let's see about Category datatype.
What is Category data type in pandas?
- Category is a datatype which can be used when we have a fixed number of string values like
- Months(Jan, Feb)
- Country Names(India, Singapore)
- Size(Small, Medium, Large)
- In a simple way is using a sequence of integer values for the strings(Jan - 1, Feb - 2 etc)
- Categories are similar to ENUM data types in other programming languages like C/C++, Java.
Advantages of using Category:
- Saving lot of memory by reducing the size
- Increasing processing speed
How to use Category in pandas dataframe:
- While reading the CSV file:
We can convert column from object to category while reading the file like below
filename = "~/Downloads/US_Accidents_Dec20.csv"
# Converting into category data type while reading CSV file
us_accidents_dec20_cat = pd.read_csv(filename, dtype = {'State' : 'category', 'City' : 'category'})
- Converting column into category type:
We can convert the column on the fly like below
# Loading csv file into data frame
filename = "~/Downloads/US_Accidents_Dec20.csv"
us_accidents_dec20_cat = pd.read_csv(filename,)
# Normal column access
us_accidents_dec20['State']
# Converting to category data type
us_accidents_dec20['State'].astype('category')
Memory comparison between Object vs Category data types:
- Normal object column:
us_accidents_dec20['State'].memory_usage(deep=True) / 1e6
Result:
249.720047
- Category column:
us_accidents_dec20['State'].astype('category').memory_usage(deep=True) / 1e6
Result:
4.23684
We can clearly observe storage space reduced from 249 to 4 which is a very huge difference.
Converting to Category data type will certainly help improve processing speed and space with a large set of data.
Happy Learning!!
P.S: Used Accidents' data of December 2020 from The USA, You can get this data from kaggle.
Top comments (0)