## DEV Community is a community of 787,776 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

keshavs759

Posted on • Originally published at vidyasheela.com

# Implementation Of KNN (From Scratch in PYTHON)

KNN classifier is one of the simplest but strong supervised machine learning algorithms. It can be used for both classification and regression problems. There are some libraries in python to implement KNN, which allows a programmer to make a KNN model easily without using deep ideas of mathematics. But if we try to implement KNN from scratch it becomes a bit tricky.

Before getting into the program lets recall the algorithm of KNN:

Algorithm for K-NN:

2.  Initialize the number of neighbors to be considered i.e. ‘K’ (must be odd).
3.  Now for each tuple (entries or data point) in the data file we perform:
1.  Calculate the distance between the data point (tuple) to be classified and each data points in the given data file.
2.  Then add the distances corresponding to data points (data entries) in the given data file (probably by adding a column for distance).
3. Sort the data in the data file from smallest to largest (in ascending order) by the distances.
1.  Pick the first K entries from the sorted collection of data.
2.  Observe the labels of the selected K entries.
3.  For classification, return the mode of the K labels and for regression, return the mean of K labels.

Now we are all ready to dive into the code. We are going to classify the iris data into its different species by observing different 4 features: sepal length, sepal width, petal length, petal width. We have altogether 150 observations(tuples) and we will make KNN classifying model on the basis of these observations.

``````import pandas as pd
import numpy as np
import operator

# making function for calculating euclidean distance
def E_Distance(x1, x2, length):
distance = 0
for x in range(length):
distance += np.square(x1[x] - x2[x])
return np.sqrt(distance)

# making function for defining K-NN model

def knn(trainingSet, testInstance, k):
distances = {}
length = testInstance.shape[1]
for x in range(len(trainingSet)):
dist = E_Distance(testInstance, trainingSet.iloc[x], length)
distances[x] = dist[0]
sortdist = sorted(distances.items(), key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(sortdist[x][0])
Count = {}  # to get most frequent class of rows
for x in range(len(neighbors)):
response = trainingSet.iloc[neighbors[x]][-1]
if response in Count:
Count[response] += 1
else:
Count[response] = 1
sortcount = sorted(Count.items(), key=operator.itemgetter(1), reverse=True)
return (sortcount[0][0], neighbors)

# making test data set
testSet = [[6.8, 3.4, 4.8, 2.4]]
test = pd.DataFrame(testSet)

# assigning different values to k
k = 1
k1 = 3
k2 = 11

# supplying test data to the model
result, neigh = knn(dataset, test, k)
result1, neigh1 = knn(dataset, test, k1)
result2, neigh2 = knn(dataset, test, k2)

# printing output prediction

print(result)
print(neigh)
print(result1)
print(neigh1)
print(result2)
print(neigh2)``````

The Output of above program is:

sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
4
4
4
Virginica
[141]
Virginica
[141, 145, 110]
Virginica
[141, 145, 110, 115, 139, 147, 77, 148, 140, 112, 144]