DEV Community

loading...

Categorical variables

thalesbruno profile image Thales Bruno Originally published at thalesbr.uno ・Updated on ・5 min read

A categorical variable (sometimes called a nominal variable) is a variable that can assume one of a limited number of possible values described as categories and there is no intrinsic ordering to the categories. It uses labels, names, or other descriptors (even numbers) to identify exclusive categories or types of things.

As an example of a categorical variable, we may mention Nationality having values like Brazilian, Canadian, French, etc., and we can see that there is no ordering between the values: we cannot say that Brazilian is higher than Canadian. In summary, there is no way to order these categories from highest to lowest or from best to worst.

Other examples of categorical variables could be Regions (North, South, East, West), Blood Type (A, B, AB, O) or Smartphone Brand (Apple, Samsumg, LG, Xiami).

However, if there is a clear order between the categories, so we are dealing with an ordinal variable, that is very similar to a categorical variable and often it's considered a special kind of this and placed on between categorical and quantitative variables. An example of an ordinal variable could be Educational Level (Elementary school education, High school graduate, Some college, College graduate, Graduate degree).

But in this article we are focusing on pure categorical or nominal variables, so let's check out what we can do with some categorical data.

Frequency distribution

Since we have a dataset with some categorical variables, the most common thing we can do is count the occurrences of each category in the whole data. This will give us a frequency distribution.

Let's take a look at some real data to demonstrate a frequency distribution. We will use the Kaggle Google Play Store Apps dataset from Lavanya Gupta. This dataset has more than 10,000 rows, each of them is an app from Google Play Store, and as features (columns) we can see the App name, Category, Rating, and others.

We will use pandas for handling the data. Firstly, we import pandas and read the CSV file downloaded from Kaggle, but only the Category column. Then, we use the unique method to show all values observed in our data. As we can see, there are 34 App Categories in our categorical variable, like Finance, Sports, Weathers and others and we can't see any order between them (Events category is not better or higher than Shopping category, for instance).

import pandas as pd

df = pd.read_csv("./data/googleplaystore.csv", usecols=['Category'])
categories = df['Category'].unique()

print(f"{len(categories)} categories:")
print(categories)
Enter fullscreen mode Exit fullscreen mode
34 categories:
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']
Enter fullscreen mode Exit fullscreen mode

Now that we know all category values we can have, let's count how many times every category occurs in our data using value_counts method.

frequency = df['Category'].value_counts()

# frequency is a pandas Series, so we'll transform it in a DataFrame just for presentation purposes
frequency_dist = pd.DataFrame(frequency)
frequency_dist.columns = ['Frequency']
frequency_dist.index.name = 'Category'

# Using head(10) to show only the first 10 lines
frequency_dist.head(10)
Enter fullscreen mode Exit fullscreen mode
Frequency
Category
FAMILY 1972
GAME 1144
TOOLS 843
MEDICAL 463
BUSINESS 460
PRODUCTIVITY 424
PERSONALIZATION 392
COMMUNICATION 387
SPORTS 384
LIFESTYLE 382

So, we can see above that most apps are from the Family category with 1,972 occurrences. Game and Tools are also common categories, on the other hand, there are few apps from the Beauty category.

Relative Frequency

At the moment we already know how many apps we have from each category. But what if we wanted to figure out what is the percentage of Medical apps of all apps? Then we need to calculate the relative frequency of category apps dividing the frequency by the total number of apps (aka the sample data).

Relative frequency of something = Frequency of something / n

Again, we will use the marvelous pandas. The relative frequency must assume a value from 0 to 1, but here we will multiply it by 100 and show the values in percentage form instead. So, as you can see below, Medical apps represent approximately 4.27% of all apps in Google Play Store according to our dataset.

frequency_dist['Relative Frequency (%)'] = (frequency_dist['Frequency']/sum(frequency_dist['Frequency']))*100

# Using head(10) to show only the first 10 lines
frequency_dist.head(10)
Enter fullscreen mode Exit fullscreen mode
Frequency Relative Frequency (%)
Category
FAMILY 1972 18.190204
GAME 1144 10.552532
TOOLS 843 7.776035
MEDICAL 463 4.270824
BUSINESS 460 4.243151
PRODUCTIVITY 424 3.911078
PERSONALIZATION 392 3.615903
COMMUNICATION 387 3.569781
SPORTS 384 3.542109
LIFESTYLE 382 3.523660

Frequency Bar Chart

Finally, we will plot the frequency variable in a Bar Chart that is a pretty common way to visualize categorical data.

import plotly.express as px

fig = px.bar(frequency)
fig.update_layout(title='Frequency Distribution of Google Play Store app categories',
                  xaxis_title='Category',
                  yaxis_title='Frequency')
fig.show()
Enter fullscreen mode Exit fullscreen mode

Bar Chart

So, in this article we have seen a bit about Categorical Variables or Nominal Variables, which is a pretty usual data type we face in Statistics, Data Analysis, Machine Learning, and so on. It was just an introductory content, but we may cover it a little deeper in upcoming posts.

References

Wikipedia | Categorical variable 🔎
UCLA | WHAT IS THE DIFFERENCE BETWEEN CATEGORICAL, ORDINAL AND NUMERICAL VARIABLES? 🔎
Brandon Foltz | Statistics 101: Describing a Categorical Variable
🔎
web.ma.utexas.edu | Ordinal Variables 🔎

Discussion

pic
Editor guide