DEV Community

Oluwafemi Paul Adeyemi
Oluwafemi Paul Adeyemi

Posted on

SL Classification using Python and R

I will be considering the following Supervised Learning classification Algorithms: logistic regression, support vector machine (SVM), k-nearest neighbors(KNN), naive-bayes, decision tree, random forest and extremely randomized trees (also called extraTrees). I will be implementing them in Python and R.

Using Python

I will be using the package called Scikit-Learn which is easy to learn and also friendly to developers who do not major in ML.

First, I import the dataset I want to use - the iris dataset.

from sklearn.datasets import load_iris
Enter fullscreen mode Exit fullscreen mode

Note that if you want to use a dataset that is not built in then you write

import pandas as pd
data = pd.read_csv('filename.csv', sep='symbol')
Enter fullscreen mode Exit fullscreen mode

sep is usually a comma, ',' but it can also be a different symbol. You can do well to investigate the data.

Next, I import the classes that contain the models I want to use.

from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
Enter fullscreen mode Exit fullscreen mode

Finally, I import the function train_test_split used to split a dataset into the training and testing sets and the function cross_val_score used to find a cross validated model accuracy.

from sklearn.model_selection import train_test_split, cross_val_score
Enter fullscreen mode Exit fullscreen mode

Now, I need to obtain the data

data = load_iris()

# You can check the structure of the data
print(data)
Enter fullscreen mode Exit fullscreen mode

Then I separate the data to features (x) and the target (y)

x = data.data
y = data.target
Enter fullscreen mode Exit fullscreen mode

Next, I split the data into the testing and training sets for x and y. I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)
Enter fullscreen mode Exit fullscreen mode

Now I can fit the models. Note that the predicted_y is a a one-dimensional numpy array of the predicted values given x_test. Note also that test_accuracy is actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on a number of subsets of the training data (I use 5 subsets of the training data by setting cv = 5 in the cross_validation_accuracy).

1. Logistic Regression

# logistic
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(x_train, y_train)

predicted_y = lr_model.predict(x_test) 
print(predicted_y)

cross_validated_accuracy = cross_val_score(lr_model, X=x_train, y=y_train, cv=5)
print(cross_validated_accuracy)

test_accuracy = lr_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

2. SVM

svm_model = SVC()
svm_model.fit(x_train, y_train)

predicted_y = svm_model.predict(x_test)
print(predicted_y)

test_accuracy = svm_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

3. KNN

knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)

predicted_y = knn_model.predict(x_test)
print(predicted_y)

test_accuracy = knn_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

4. Naive Bayes

nb_model = MultinomialNB()
nb_model.fit(x_train, y_train)

predicted_y = nb_model.predict(x_test)
print(predicted_y)

test_accuracy = nb_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

5. Decision Tree

dt_model = DecisionTreeClassifier()
dt_model.fit(x_train, y_train)

predicted_y = dt_model.predict(x_test)
print(predicted_y)

test_accuracy = dt_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

6. Random Forest

rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

predicted_y = rf_model.predict(x_test)
print(predicted_y)

test_accuracy = rf_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

7. Extremely Randomized Trees

et_model = ExtraTreesClassifier()
et_model.fit(x_train, y_train)

predicted_y = et_model.predict(x_test)
print(predicted_y)

test_accuracy = et_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Enter fullscreen mode Exit fullscreen mode

Using R

The caret package will be used for this purpose. Note that the cross validation accuracy for each model will be displayed when the model is printed.

First, I import the dataset package which contains the iris dataset

library(datasets)
Enter fullscreen mode Exit fullscreen mode

Metrics is a package containing functions that return certain metrics among which is a function called accuracy that returns the accuracy of a model's prediction.

library(Metrics)
Enter fullscreen mode Exit fullscreen mode

and the dataset package

library(datasets)
Enter fullscreen mode Exit fullscreen mode

Note that if you wanted to use a dataset that is not built in then you write

data = read.csv('filename.csv', sep ='symbol')
Enter fullscreen mode Exit fullscreen mode

sep is usually a comma, ',' but it can also be a different symbol. You can do well to investigate the data

Next, I import the caret package which uses a number of other packages (you may not necessarily know) to do stuffs in machine learning.

library(caret)
Enter fullscreen mode Exit fullscreen mode

Now, I need to obtain our data

data = iris

# You can check the structure of the data
print(data)
Enter fullscreen mode Exit fullscreen mode

Using the function createDataPartition, I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.

train_index = createDataPartition(y=data$Species, p=0.8, list=FALSE)
Enter fullscreen mode Exit fullscreen mode

Then, I separate the data into training and testing sets.

train_data = data[train_index,]
test_data = data[-train_index,]
Enter fullscreen mode Exit fullscreen mode

Then, I convert the target variable to a factor, because it was initially a string of characters. But after making it a factor, it is discretized.

train_data$Species = factor(train_data$Species)
test_data$Species = factor(test_data$Species)
Enter fullscreen mode Exit fullscreen mode

Next, I control the computational nuance of the train function which I will be using soon.

# The trainControl function uses 5 k_folds for cross validation, hence number = 5
control = trainControl(method = "cv", number=5) 
Enter fullscreen mode Exit fullscreen mode

Next, I fit the models. Note that in the train function, preProcess = c("center", "scale") processes the data such that for each variable, the mean is subtracted from each data point due to center and for each variable also, it divides all the data points by the standard deviation due to scale - this means that using preProcess = c("center", "scale") standardizes the data. Note also that tuneLength is actually an integer denoting the amount of granularity in the tuning parameter grid.

Note that the predicted_y is a vector of the predicted values given x_test. Note also that test_accuracy is the actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on some subsets of the training data.

1. Logistic Regression


logistic_model = train(Species ~.,
                       data = train_data,
                       method = "multinom",
                       trControl=control,
                       preProcess = c("center", "scale"),
                       tuneLength = 10)

print(logistic_model)

predicted_y = predict(logistic_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
Enter fullscreen mode Exit fullscreen mode

2. Suport Vector Machine

svm_model = train(Species ~., 
                  data = train_data, 
                  method = "svmLinear",
                  trControl=control,
                  preProcess = c("center", "scale"), 
                  tuneLength = 10)

print(svm_model)

predicted_y = predict(svm_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
Enter fullscreen mode Exit fullscreen mode

3. KNN

knn_model = train(Species ~.,
                  data = train_data, 
                  method = "knn", 
                  trControl=control, 
                  preProcess = c("center", "scale"), 
                  tuneLength = 10)

print(knn_model)

predicted_y = predict(knn_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
Enter fullscreen mode Exit fullscreen mode

4. Naive Bayes

nb_model = train(Species ~.,
                 data = train_data, 
                 method = "nb", 
                 trControl=control, 
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

print(nb_model)

predicted_y = predict(nb_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))

Enter fullscreen mode Exit fullscreen mode

5. Decision Tree

decision_tree_model = train(Species ~., 
                            data=train_data,  
                            method = "rpart",
                            trControl=control,  
                            preProcess = c("center", "scale"),
                            tuneLength = 10)

print(decision_tree_model)

predicted_y = predict(decision_tree_model, test_data)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
Enter fullscreen mode Exit fullscreen mode

6. Random Forest

random_forest_model = train(Species ~., 
                            data=train_data,
                            method = "rf",
                            trControl=control,
                            preProcess = c("center", "scale"),
                            tuneLength = 10)

print(random_forest_model)

predicted_y =  predict(random_forest_model, test_data)
print(y_predicted)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
Enter fullscreen mode Exit fullscreen mode

7. Extremely Randomized Trees

et_model = train(Species ~., 
                 data=train_data,
                 method = "ranger",
                 trControl=control,
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

print(et_model)

predicted_y = predict(et_model, test_data)
print(y_predicted)

test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
Enter fullscreen mode Exit fullscreen mode

And that is it. I think it is super easy to write these codes, if only you see the repetitive patterns in the codes for both Python and R - actually, each language has its own unique pattern. Thanks for reading.
New to R? See:R, beyond Statistical Programming

New to Machine Learning? See:
Introduction to Machine Learning

Not sure which program to use for your ML? See:Best Programming Language for ML

Top comments (2)

Collapse
 
fetterollie profile image
Jonathan Fetterolf

Wonderful break-down of the differences in Python and R. Thanks for sharing!

Collapse
 
opaul profile image
Oluwafemi Paul Adeyemi

Thank you very much @fetterollie.