Oluwafemi Paul Adeyemi

Posted on Aug 9, 2023

SL Classification using Python and R

#supervisedlearning #algorithms #python #r

I will be considering the following Supervised Learning classification Algorithms: logistic regression, support vector machine (SVM), k-nearest neighbors(KNN), naive-bayes, decision tree, random forest and extremely randomized trees (also called extraTrees). I will be implementing them in Python and R.

Using Python

I will be using the package called Scikit-Learn which is easy to learn and also friendly to developers who do not major in ML.

First, I import the dataset I want to use - the iris dataset.

from sklearn.datasets import load_iris

Note that if you want to use a dataset that is not built in then you write

import pandas as pd
data = pd.read_csv('filename.csv', sep='symbol')

sep is usually a comma, ',' but it can also be a different symbol. You can do well to investigate the data.

Next, I import the classes that contain the models I want to use.

from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

Finally, I import the function train_test_split used to split a dataset into the training and testing sets and the function cross_val_score used to find a cross validated model accuracy.

from sklearn.model_selection import train_test_split, cross_val_score

Now, I need to obtain the data

data = load_iris()

# You can check the structure of the data
print(data)

Then I separate the data to features (x) and the target (y)

x = data.data
y = data.target

Next, I split the data into the testing and training sets for x and y. I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)

Now I can fit the models. Note that the predicted_y is a a one-dimensional numpy array of the predicted values given x_test. Note also that test_accuracy is actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on a number of subsets of the training data (I use 5 subsets of the training data by setting cv = 5 in the cross_validation_accuracy).

1. Logistic Regression

# logistic
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(x_train, y_train)

predicted_y = lr_model.predict(x_test) 
print(predicted_y)

cross_validated_accuracy = cross_val_score(lr_model, X=x_train, y=y_train, cv=5)
print(cross_validated_accuracy)

test_accuracy = lr_model.score(x_test, y_test)
print(round(test_accuracy, 4))

2. SVM

svm_model = SVC()
svm_model.fit(x_train, y_train)

predicted_y = svm_model.predict(x_test)
print(predicted_y)

test_accuracy = svm_model.score(x_test, y_test)
print(round(test_accuracy, 4))

3. KNN

knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)

predicted_y = knn_model.predict(x_test)
print(predicted_y)

test_accuracy = knn_model.score(x_test, y_test)
print(round(test_accuracy, 4))

4. Naive Bayes

nb_model = MultinomialNB()
nb_model.fit(x_train, y_train)

predicted_y = nb_model.predict(x_test)
print(predicted_y)

test_accuracy = nb_model.score(x_test, y_test)
print(round(test_accuracy, 4))

5. Decision Tree

dt_model = DecisionTreeClassifier()
dt_model.fit(x_train, y_train)

predicted_y = dt_model.predict(x_test)
print(predicted_y)

test_accuracy = dt_model.score(x_test, y_test)
print(round(test_accuracy, 4))

6. Random Forest

rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

predicted_y = rf_model.predict(x_test)
print(predicted_y)

test_accuracy = rf_model.score(x_test, y_test)
print(round(test_accuracy, 4))

7. Extremely Randomized Trees

et_model = ExtraTreesClassifier()
et_model.fit(x_train, y_train)

predicted_y = et_model.predict(x_test)
print(predicted_y)

test_accuracy = et_model.score(x_test, y_test)
print(round(test_accuracy, 4))

Using R

The caret package will be used for this purpose. Note that the cross validation accuracy for each model will be displayed when the model is printed.

First, I import the dataset package which contains the iris dataset

library(datasets)

Metrics is a package containing functions that return certain metrics among which is a function called accuracy that returns the accuracy of a model's prediction.

library(Metrics)

and the dataset package

library(datasets)

Note that if you wanted to use a dataset that is not built in then you write

data = read.csv('filename.csv', sep ='symbol')

sep is usually a comma, ',' but it can also be a different symbol. You can do well to investigate the data

Next, I import the caret package which uses a number of other packages (you may not necessarily know) to do stuffs in machine learning.

library(caret)

Now, I need to obtain our data

data = iris

# You can check the structure of the data
print(data)

Using the function createDataPartition, I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.

train_index = createDataPartition(y=data$Species, p=0.8, list=FALSE)

Then, I separate the data into training and testing sets.

train_data = data[train_index,]
test_data = data[-train_index,]

Then, I convert the target variable to a factor, because it was initially a string of characters. But after making it a factor, it is discretized.

train_data$Species = factor(train_data$Species)
test_data$Species = factor(test_data$Species)

Next, I control the computational nuance of the train function which I will be using soon.

# The trainControl function uses 5 k_folds for cross validation, hence number = 5
control = trainControl(method = "cv", number=5)

Next, I fit the models. Note that in the train function, preProcess = c("center", "scale") processes the data such that for each variable, the mean is subtracted from each data point due to center and for each variable also, it divides all the data points by the standard deviation due to scale - this means that using preProcess = c("center", "scale") standardizes the data. Note also that tuneLength is actually an integer denoting the amount of granularity in the tuning parameter grid.

Note that the predicted_y is a vector of the predicted values given x_test. Note also that test_accuracy is the actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on some subsets of the training data.

1. Logistic Regression


logistic_model = train(Species ~.,
                       data = train_data,
                       method = "multinom",
                       trControl=control,
                       preProcess = c("center", "scale"),
                       tuneLength = 10)

print(logistic_model)

predicted_y = predict(logistic_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

2. Suport Vector Machine

svm_model = train(Species ~., 
                  data = train_data, 
                  method = "svmLinear",
                  trControl=control,
                  preProcess = c("center", "scale"), 
                  tuneLength = 10)

print(svm_model)

predicted_y = predict(svm_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

3. KNN

knn_model = train(Species ~.,
                  data = train_data, 
                  method = "knn", 
                  trControl=control, 
                  preProcess = c("center", "scale"), 
                  tuneLength = 10)

print(knn_model)

predicted_y = predict(knn_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

4. Naive Bayes

nb_model = train(Species ~.,
                 data = train_data, 
                 method = "nb", 
                 trControl=control, 
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

print(nb_model)

predicted_y = predict(nb_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

5. Decision Tree

decision_tree_model = train(Species ~., 
                            data=train_data,  
                            method = "rpart",
                            trControl=control,  
                            preProcess = c("center", "scale"),
                            tuneLength = 10)

print(decision_tree_model)

predicted_y = predict(decision_tree_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

6. Random Forest

random_forest_model = train(Species ~., 
                            data=train_data,
                            method = "rf",
                            trControl=control,
                            preProcess = c("center", "scale"),
                            tuneLength = 10)

print(random_forest_model)

predicted_y =  predict(random_forest_model, test_data)
print(y_predicted)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

7. Extremely Randomized Trees

et_model = train(Species ~., 
                 data=train_data,
                 method = "ranger",
                 trControl=control,
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

print(et_model)

predicted_y = predict(et_model, test_data)
print(y_predicted)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

And that is it. I think it is super easy to write these codes, if only you see the repetitive patterns in the codes for both Python and R - actually, each language has its own unique pattern. Thanks for reading.
New to R? See:R, beyond Statistical Programming

New to Machine Learning? See:
Introduction to Machine Learning

Not sure which program to use for your ML? See:Best Programming Language for ML