DEV Community

Marius Borcan
Marius Borcan

Posted on • Updated on • Originally published at programmerbackpack.com

Naive Bayes Classifier Tutorial in Python and Scikit-Learn

This article was originally published on https://programmerbackpack.com.

Naive Bayes Classifier is a simple model that's usually used in classification problems. Despite being simple, it has shown very good results, outperforming by far other, more complicated models.

This is the second article in a series of two about the Naive Bayes Classifier and it will deal with the implementation of the model in Scikit-Learn with Python. For a detailed overview of the math and the principles behind the model, please check the other article: Naive Bayes Classifier Explained.

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.

Data for training the Naive Bayes Classifier

In the previous article linked above, I introduced a table of some data that we can train our classifier on. For convenience, I'll paste it again here.

Naive Bayes Classifier training data

The purpose of this data is, given 3 facts about a certain moment(the weather, whether it is a weekend or a workday or whether it is morning, lunch or evening), can we predict if there's a traffic jam in the city?

Note: the dataset is built by me for the sake of simplictiy and the values are based on common sense situations.

Naive Bayes Classifier implementation in Scikit-Learn

Now let's get to work. We need only one dependency installed for this, and that is the scikit-learn python library. It is one of the most powerful librarie for machine learning and data science and it is free to use. So let's install the library.

pip3 install scikit-learn

Now let's import the dependencies. We only have two dependencies and I'll explain why we need them.

from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

The sklearn library contains more than one Naive Bayes classifiers and each is different by means of implementation. Not every classifier implementation is recommended for one type of problem. That's why we have more than one implementation, because some classifiers perform better on some types of data, while others don't. The types of classifiers that the library contains are:

  • Gaussian Naive Bayes
  • Multinomial Naive Bayes
  • Complement Naive Bayes
  • Bernoulli Naive Bayes
  • Categorical Naive Bayes

For today we are going to choose the Gaussian Naive Bayes. For further details on all the other types of classifiers, please read this.

Moving on we need to construct our dataset. For reasons of speed and simplicity and because our dataset is quite small, I've created 4 simple methods to hardcode the data in the dataset.

def getWeather():
    return ['Clear', 'Clear', 'Clear', 'Clear', 'Clear', 'Clear',
            'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',
            'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy']

def getTimeOfWeek():
    return ['Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend']

def getTimeOfDay():
    return ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            ]

def getTrafficJam():
    return ['Yes', 'No', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'Yes', 'No', 'Yes'
            ]

No let's try gathering this data and building our model. But first, we need to do a little bit of preprocessing. Computers are generally bad at understanding text, but they are very good with numbers. So we need to transform our text data into numbers so that our model can better understand it.

Label Encoder

This is a encoder provided by scikit that transforms categorical data from text to number. If we have n possible values in our dataset, then LabelEncoder model will transform it into numbers from 0 to n-1 so that each textual value has a number representation. For example, our weather value will be encoded like this.

  weather = ['Clear', 'Clear', 'Clear', 'Clear', 'Clear', 'Clear',
            'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',
            'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy']
  labelEncoder = preprocessing.LabelEncoder();
  print (labelEncoder.fit_transform(weather))

  # Prints [0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2]

Let's also represent our traffic jam values.

    trafficJam = ['Yes', 'No', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'Yes', 'No', 'Yes'
            ]
    print(labelEncoder.fit_transform(trafficJam))

    # Prints [1 0 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 1]

Training the model

We need to transform all 4 values(3 features and the label) and then we can train the model.

    # Get the data
    weather = getWeather()
    timeOfWeek = getTimeOfWeek()
    timeOfDay = getTimeOfDay()
    trafficJam = getTrafficJam()

    labelEncoder = preprocessing.LabelEncoder()

    # Encode the features and the labels
    encodedWeather = labelEncoder.fit_transform(weather)
    encodedTimeOfWeek = labelEncoder.fit_transform(timeOfWeek)
    encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay)
    encodedTrafficJam = labelEncoder.fit_transform(trafficJam)

    # Build the features
    features = []
    for i in range(len(encodedWeather)):
        features.append([encodedWeather[i], encodedTimeOfWeek[i], encodedTimeOfDay[i]])

    model = GaussianNB()

    # Train the model
    model.fit(features, encodedTrafficJam)

Using our model for predictions

Now we can use this model to make predictions about the traffic jam.

# ["Snowy", "Workday", "Morning"]
print(model.predict([[2, 1, 2]]))
# Prints [1], meaning "Yes"
# ["Clear", "Weekend", "Lunch"]
print (model.predict([[0, 0, 1]]))
# Prints [0], meaning "No"

That's great! For me, it looks that we managed to build a simple classifier using so little data. For building a more realistic model, we would need more features and more entries. But still, for learning purposes, I think we did a really good job.

Don't forget, if you want dig down the math and principles behind this classifier, you can always check my other article in this mini-series, Naive Bayes Classifier Explained.

Thank you so much for reading this! Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.

Top comments (0)