I will be considering the following Supervised Learning classification Algorithms: logistic regression, support vector machine (SVM), k-nearest neighbors(KNN), naive-bayes, decision tree, random forest and extremely randomized trees (also called extraTrees). I will be implementing them in Python and R.

## Using Python

I will be using the package called Scikit-Learn which is easy to learn and also friendly to developers who do not major in ML.

First, I import the dataset I want to use - the iris dataset.

```
from sklearn.datasets import load_iris
```

Note that if you want to use a dataset that is not built in then you write

```
import pandas as pd
data = pd.read_csv('filename.csv', sep='symbol')
```

**sep** is usually a comma, **','** but it can also be a different symbol. You can do well to investigate the data.

Next, I import the classes that contain the models I want to use.

```
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
```

Finally, I import the function **train_test_split** used to split a dataset into the training and testing sets and the function **cross_val_score** used to find a cross validated model accuracy.

```
from sklearn.model_selection import train_test_split, cross_val_score
```

Now, I need to obtain the data

```
data = load_iris()
# You can check the structure of the data
print(data)
```

Then I separate the data to features (x) and the target (y)

```
x = data.data
y = data.target
```

Next, I split the data into the testing and training sets for x and y. I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.

```
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)
```

Now I can fit the models. Note that the **predicted_y** is a a one-dimensional numpy array of the predicted values given x_test. Note also that test_accuracy is actually the accuracy of the model's prediction on the basis of the test data while cross_validated_accuracy is the accuracy of the model's prediction based on a number of subsets of the training data (I use 5 subsets of the training data by setting **cv** = 5 in the **cross_validation_accuracy**).

### 1. Logistic Regression

```
# logistic
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(x_train, y_train)
predicted_y = lr_model.predict(x_test)
print(predicted_y)
cross_validated_accuracy = cross_val_score(lr_model, X=x_train, y=y_train, cv=5)
print(cross_validated_accuracy)
test_accuracy = lr_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

### 2. SVM

```
svm_model = SVC()
svm_model.fit(x_train, y_train)
predicted_y = svm_model.predict(x_test)
print(predicted_y)
test_accuracy = svm_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

### 3. KNN

```
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)
predicted_y = knn_model.predict(x_test)
print(predicted_y)
test_accuracy = knn_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

### 4. Naive Bayes

```
nb_model = MultinomialNB()
nb_model.fit(x_train, y_train)
predicted_y = nb_model.predict(x_test)
print(predicted_y)
test_accuracy = nb_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

### 5. Decision Tree

```
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train, y_train)
predicted_y = dt_model.predict(x_test)
print(predicted_y)
test_accuracy = dt_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

### 6. Random Forest

```
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)
predicted_y = rf_model.predict(x_test)
print(predicted_y)
test_accuracy = rf_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

### 7. Extremely Randomized Trees

```
et_model = ExtraTreesClassifier()
et_model.fit(x_train, y_train)
predicted_y = et_model.predict(x_test)
print(predicted_y)
test_accuracy = et_model.score(x_test, y_test)
print(round(test_accuracy, 4))
```

## Using R

The caret package will be used for this purpose. Note that the cross validation accuracy for each model will be displayed when the model is printed.

First, I import the dataset package which contains the iris dataset

```
library(datasets)
```

Metrics is a package containing functions that return certain metrics among which is a function called **accuracy** that returns the accuracy of a model's prediction.

```
library(Metrics)
```

and the dataset package

```
library(datasets)
```

Note that if you wanted to use a dataset that is not built in then you write

```
data = read.csv('filename.csv', sep ='symbol')
```

**sep** is usually a comma, **','** but it can also be a different symbol. You can do well to investigate the data

Next, I import the caret package which uses a number of other packages (you may not necessarily know) to do stuffs in machine learning.

```
library(caret)
```

Now, I need to obtain our data

```
data = iris
# You can check the structure of the data
print(data)
```

Using the function **createDataPartition**, I am taking 80% of the original dataset to be the training data and the remaining 20% to be the test data.

```
train_index = createDataPartition(y=data$Species, p=0.8, list=FALSE)
```

Then, I separate the data into training and testing sets.

```
train_data = data[train_index,]
test_data = data[-train_index,]
```

Then, I convert the target variable to a factor, because it was initially a string of characters. But after making it a factor, it is discretized.

```
train_data$Species = factor(train_data$Species)
test_data$Species = factor(test_data$Species)
```

Next, I control the computational nuance of the **train** function which I will be using soon.

```
# The trainControl function uses 5 k_folds for cross validation, hence number = 5
control = trainControl(method = "cv", number=5)
```

Next, I fit the models. Note that in the **train** function, *preProcess = c("center", "scale")* processes the data such that for each variable, the mean is subtracted from each data point due to **center** and for each variable also, it divides all the data points by the standard deviation due to **scale** - this means that using *preProcess = c("center", "scale")* standardizes the data. Note also that **tuneLength** is actually an integer denoting the amount of granularity in the tuning parameter grid.

Note that the *predicted_y* is a vector of the predicted values given x_test. Note also that *test_accuracy* is the actually the accuracy of the model's prediction on the basis of the test data while *cross_validated_accuracy* is the accuracy of the model's prediction based on some subsets of the training data.

### 1. Logistic Regression

```
logistic_model = train(Species ~.,
data = train_data,
method = "multinom",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(logistic_model)
predicted_y = predict(logistic_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

### 2. Suport Vector Machine

```
svm_model = train(Species ~.,
data = train_data,
method = "svmLinear",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(svm_model)
predicted_y = predict(svm_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

### 3. KNN

```
knn_model = train(Species ~.,
data = train_data,
method = "knn",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(knn_model)
predicted_y = predict(knn_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

### 4. Naive Bayes

```
nb_model = train(Species ~.,
data = train_data,
method = "nb",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(nb_model)
predicted_y = predict(nb_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

### 5. Decision Tree

```
decision_tree_model = train(Species ~.,
data=train_data,
method = "rpart",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(decision_tree_model)
predicted_y = predict(decision_tree_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

### 6. Random Forest

```
random_forest_model = train(Species ~.,
data=train_data,
method = "rf",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(random_forest_model)
predicted_y = predict(random_forest_model, test_data)
print(y_predicted)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

### 7. Extremely Randomized Trees

```
et_model = train(Species ~.,
data=train_data,
method = "ranger",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(et_model)
predicted_y = predict(et_model, test_data)
print(y_predicted)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
```

And that is it. I think it is super easy to write these codes, if only you see the repetitive patterns in the codes for both Python and R - actually, each language has its own unique pattern. Thanks for reading.

New to R? See:R, beyond Statistical Programming

New to Machine Learning? See:

Introduction to Machine Learning

Not sure which program to use for your ML? See:Best Programming Language for ML

## Top comments (2)

Wonderful break-down of the differences in Python and R. Thanks for sharing!

Thank you very much @fetterollie.