Victor Isaac Oshimua

Posted on Jun 22, 2023 • Edited on Jul 1, 2023

One-Hot Encoding with DictVectorizer

#tutorial #machinelearning #datascience #data

Introduction

Usually, datasets used for training machine learning models contain feature columns with various data types, one of which is categorical features, these features are non-numerical.
Examples include:

Name (with values like "Kelvin", "Jonathan")
Gender (with values like "male", "female")
Country (with values like "Nigeria", "USA")

In many cases, categorical features are represented as strings, and most machine learning algorithms cannot process strings unless we convert them to numerical values.

There are various methods to deal with categorical variables, one of which is One-Hot encoding.
This article will guide you on how to implement one-hot encoding with DictVectorizer.

Prerequisites

Basic understanding of Python
Basic understanding of data science libraries e.g Pandas, Numpy, Scikit learn
Jupyter notebook to test try codes yourself
Basic understanding of Machine learning

What is One-Hot Encoding?

One hot encoding is a method used for converting categorical variables to numerical values.

One-hot encoding assigns binary features to unique categorical values. If a value is present in an observation, its corresponding feature is set to 1; otherwise, it is set to 0.
For example:

In the above diagram, the original data has a column called "Country" that contains the following values: "NIGERIA", "USA", "JAPAN", and "TOGO".

One-hot encoding created four new binary columns, one for each unique category, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not.

What is DictVectorizer?

As implied by its name, DictVectorizer is a class that transforms lists of feature-value mappings(Python dict objects) into vectors.

Implementing one-hot encoding

Now that you have a grasp of One-Hot encoding and DictVectorizer, let's dive into putting it into practice.

To implement one-hot encoding, tabular data with categorical features is needed, hence a Kaggle dataset will be used in this guide, follow this link to download the dataset used in this guide.

The following steps will guide you on using DictVectorizer to implement one hot encoding.

1. Read and process the data

# import libraries 
import pandas as pd
import numpy as np
# Read the data
data = pd.read_csv("drug200.csv")
# select categorical data 
columns = ["Sex", "BP", "Cholesterol"]
# select only the first 10 rows
categorical=data[columns].iloc[:11]
categorical

The above code imports necessary data science libraries needed to read and process the data
The code also reads the CSV (comma-separated values) data into a pandas DataFrame and selects the columns with categorical features.

Here is the output of the code

2. Convert categorical features to a list of dictionaries

categorical_dict = categorical.to_dict(orient="records")
categorical_dict

Here is the output of the code.

3. Initiate an instance of DictVectorizer class

from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse = False)

Here's a breakdown of what the code does:

from sklearn.feature_extraction import DictVectorizer:
This line imports the DictVectorizer class from the sklearn.feature_extraction module.
dv = DictVectorizer(sparse=False):
This line creates an instance of the DictVectorizer class and assigns it to the variable dv.
The sparse=False argument is passed to the DictVectorizer constructor, indicating that the resulting matrix representation should be a dense numpy array rather than sparse.

4. Fit and transform the dictionary.

dv.fit(categorical_dict)
transformed_data=dv.transform(categorical_dict)
transformed_data

Here is the result of the above code.

We successfully One-Hot encoded the categorical data with DictVectorizer.

To explore more, let's check how the categorical data were represented on the transformed data.

dv.get_feature_names()

Here is the output of the code

Furthermore, here is a diagram of how the column/feature names are stored on the transformed data.

Conclusion

DictVectorizer is a way of performing One-Hot encoding on categorical data, it takes a list of dictionaries and transforms them to numpy arrays.

DictVectorizer is easy to implement and makes machine learning model deployment simpler.

I would recommend utilizing DictVectorizer to transform your categorical data into numerical representations for your future machine learning projects.

Thanks for reading this article. If you have any further questions or would like to connect, feel free to reach out to me on Twitter and on LinkedIn. I appreciate your engagement and look forward to staying connected.

DEV Community