Marcos

Posted on

# Understand the difference between quantitative and categorical features

Learn about the different feature types that can be part of a dataset.

In the context of data analysis using pandas DataFrames in Python, understanding the difference between quantitative and categorical characteristics is crucial. Let's break down these concepts using clear explanations and intuitive analogies.

## Quantitative vs. Categorical

The columns in a DataFrame are known as features of the dataset it embodies, which can be either quantitative or categorical.

Quantitative features, like height or weight, are those that can be expressed in numbers. These are the features for which we can compute sums, averages, and other numerical values.

1. **Continuous: **Can take on any value within a range. Example: height, weight, temperature.
2. Discrete: Can only take on specific and distinct values. Example: number of children, number of cars.
``````import pandas as pd

df_quant = pd.DataFrame({
'Height': [1.70, 1.75, 1.60, 1.80],
'Weight': [70, 80, 60, 90],
'Age': [25, 30, 22, 28]
})

print(df_quant)
``````

Categorical features, such as gender or place of birth, involve values that categorize the dataset. These are the ones we would utilize with the `groupby` function.

1. Nominal: They have no intrinsic order. Example: colors (red, blue, green), genders (male, female).
2. Ordinal: Have an intrinsic order. Example: clothing sizes (P, M, G), classifications (low, medium, high).
``````import pandas as pd

df_cat = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green', 'Yellow'],
'Size': ['M', 'G', 'P', 'M'],
'Gender': ['Female', 'Male', 'Female', 'Male']
})

print(df_cat)
``````

Some features can be interpreted as both quantitative or categorical, based on the context. For instance, the year of birth can be treated as a quantitative feature when calculating average birth year statistics. Alternatively, it can serve as a categorical feature to group data by birth years.

## Identifying Quantitative and Categorical Features

In Pandas, you can automatically identify whether a column is quantitative or categorical by using the column's data type (`dtype`). Generally, columns with `int64` or `float64` data types are quantitative, while columns with `object` type are categorical. Categorical columns can be converted to the `category` type for optimization.

``````import pandas as pd

# Creating a mixed DataFrame
df = pd.DataFrame({
'Height': [1.70, 1.75, 1.60, 1.80],
'Weight': [70, 80, 60, 90],
'Color': ['Red', 'Blue', 'Green', 'Yellow'],
'Size': ['M', 'G', 'P', 'M']
})

# Identifying quantitative and categorical columns
quant_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

print("Quantitative columns:", quant_cols)
print("Categorical columns:", cat_cols)
``````

1. Quantitative: Numerical values, continuous or discrete.
2. Categorical: Values representing categories or groups, nominal or ordinal.

Each type of feature requires specific treatment and analysis, so it's important to identify them correctly in order to apply the appropriate techniques in your data analysis and predictive modeling.