I will be considering the following Supervised Learning classification Algorithms: logistic regression, support vector machine (SVM), k-nearest neighbors(KNN), naive-bayes, decision tree, random forest and extremely randomized trees (also called extraTrees). I will be implementing them in Python and R.
Using Python
I will be using the package called Scikit-Learn which is easy to learn and also friendly to developers who do not major in ML.
First, I import the dataset I want to use - the iris dataset.
from sklearn.datasets import load_iris
Note that if you want to use a dataset that is not built in then you write
import pandas as pd
data = pd.read_csv('filename.csv', sep='symbol')
sep is usually a comma, ',' but it can also be a different symbol. You can do well to investigate the data.
Next, I import the classes that contain the models I want to use.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
Finally, I import the function train_test_split used to split a dataset into the training and testing sets and the function cross_val_score used to find a cross validated model accuracy.
from sklearn.model_selection import train_test_split, cross_val_score
Now, I need to obtain the data
data = load_iris()
# You can check the structure of the data
print(data)
Then I separate the data to features (x) and the target (y)
x = data.data
y = data.target
Next, I split the data into the testing and training sets for x and y. I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)
Now I can fit the models. Note that the predicted_y is a a one-dimensional numpy array of the predicted values given x_test. Note also that test_accuracy is actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on a number of subsets of the training data (I use 5 subsets of the training data by setting cv = 5 in the cross_validation_accuracy).
1. Logistic Regression
# logistic
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(x_train, y_train)
predicted_y = lr_model.predict(x_test)
print(predicted_y)
cross_validated_accuracy = cross_val_score(lr_model, X=x_train, y=y_train, cv=5)
print(cross_validated_accuracy)
test_accuracy = lr_model.score(x_test, y_test)
print(round(test_accuracy, 4))
2. SVM
svm_model = SVC()
svm_model.fit(x_train, y_train)
predicted_y = svm_model.predict(x_test)
print(predicted_y)
test_accuracy = svm_model.score(x_test, y_test)
print(round(test_accuracy, 4))
3. KNN
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)
predicted_y = knn_model.predict(x_test)
print(predicted_y)
test_accuracy = knn_model.score(x_test, y_test)
print(round(test_accuracy, 4))
4. Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(x_train, y_train)
predicted_y = nb_model.predict(x_test)
print(predicted_y)
test_accuracy = nb_model.score(x_test, y_test)
print(round(test_accuracy, 4))
5. Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train, y_train)
predicted_y = dt_model.predict(x_test)
print(predicted_y)
test_accuracy = dt_model.score(x_test, y_test)
print(round(test_accuracy, 4))
6. Random Forest
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)
predicted_y = rf_model.predict(x_test)
print(predicted_y)
test_accuracy = rf_model.score(x_test, y_test)
print(round(test_accuracy, 4))
7. Extremely Randomized Trees
et_model = ExtraTreesClassifier()
et_model.fit(x_train, y_train)
predicted_y = et_model.predict(x_test)
print(predicted_y)
test_accuracy = et_model.score(x_test, y_test)
print(round(test_accuracy, 4))
Using R
The caret package will be used for this purpose. Note that the cross validation accuracy for each model will be displayed when the model is printed.
First, I import the dataset package which contains the iris dataset
library(datasets)
Metrics is a package containing functions that return certain metrics among which is a function called accuracy that returns the accuracy of a model's prediction.
library(Metrics)
and the dataset package
library(datasets)
Note that if you wanted to use a dataset that is not built in then you write
data = read.csv('filename.csv', sep ='symbol')
sep is usually a comma, ',' but it can also be a different symbol. You can do well to investigate the data
Next, I import the caret package which uses a number of other packages (you may not necessarily know) to do stuffs in machine learning.
library(caret)
Now, I need to obtain our data
data = iris
# You can check the structure of the data
print(data)
Using the function createDataPartition, I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.
train_index = createDataPartition(y=data$Species, p=0.8, list=FALSE)
Then, I separate the data into training and testing sets.
train_data = data[train_index,]
test_data = data[-train_index,]
Then, I convert the target variable to a factor, because it was initially a string of characters. But after making it a factor, it is discretized.
train_data$Species = factor(train_data$Species)
test_data$Species = factor(test_data$Species)
Next, I control the computational nuance of the train function which I will be using soon.
# The trainControl function uses 5 k_folds for cross validation, hence number = 5
control = trainControl(method = "cv", number=5)
Next, I fit the models. Note that in the train function, preProcess = c("center", "scale") processes the data such that for each variable, the mean is subtracted from each data point due to center and for each variable also, it divides all the data points by the standard deviation due to scale - this means that using preProcess = c("center", "scale") standardizes the data. Note also that tuneLength is actually an integer denoting the amount of granularity in the tuning parameter grid.
Note that the predicted_y is a vector of the predicted values given x_test. Note also that test_accuracy is the actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on some subsets of the training data.
1. Logistic Regression
logistic_model = train(Species ~.,
data = train_data,
method = "multinom",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(logistic_model)
predicted_y = predict(logistic_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
2. Suport Vector Machine
svm_model = train(Species ~.,
data = train_data,
method = "svmLinear",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(svm_model)
predicted_y = predict(svm_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
3. KNN
knn_model = train(Species ~.,
data = train_data,
method = "knn",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(knn_model)
predicted_y = predict(knn_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
4. Naive Bayes
nb_model = train(Species ~.,
data = train_data,
method = "nb",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(nb_model)
predicted_y = predict(nb_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
5. Decision Tree
decision_tree_model = train(Species ~.,
data=train_data,
method = "rpart",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(decision_tree_model)
predicted_y = predict(decision_tree_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
6. Random Forest
random_forest_model = train(Species ~.,
data=train_data,
method = "rf",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(random_forest_model)
predicted_y = predict(random_forest_model, test_data)
print(y_predicted)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
7. Extremely Randomized Trees
et_model = train(Species ~.,
data=train_data,
method = "ranger",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(et_model)
predicted_y = predict(et_model, test_data)
print(y_predicted)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
And that is it. I think it is super easy to write these codes, if only you see the repetitive patterns in the codes for both Python and R - actually, each language has its own unique pattern. Thanks for reading.
New to R? See:R, beyond Statistical Programming
New to Machine Learning? See:
Introduction to Machine Learning
Not sure which program to use for your ML? See:Best Programming Language for ML
Top comments (2)
Wonderful break-down of the differences in Python and R. Thanks for sharing!
Thank you very much @fetterollie.