DEV Community is a community of 602,908 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Implementing the K Nearest Neighbors algorithm from scratch in Python

Machine learning to most is a black box technique. This means that the inputs and outputs are known but the process is not fully understood. However, I think that for most classical algorithms this should not be the case. This piece aims to help you learn to implement the K Nearest Neighbor algorithm in Python.

The principle behind nearest neighbor methods, in general, is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples, in this case, is a user-defined constant.

Despite its simplicity, nearest neighbor methods have been successful in a large number of classification and regression problems, including handwritten digits and satellite image scenes.

The optimal choice of the value "k" is dependent on the data but in general, a larger value suppresses the effects of noise but makes the classification boundaries less distinct. Enough of the talking, lets begin...

``````import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.base import BaseEstimator

class KNNBase(BaseEstimator):
...
``````

Firstly, we import the relevant modules which are `numpy`, `euclidean` and `BaseEstimator`. Euclidean determines how distance is calculated while `BaseEstimator` is the base class for all estimators in sklearn.

The `KNNBase` class thus takes this class as its parent and inherits its methods. This then becomes the base class for the K Nearest Neighbors classifier and regressor.

``````class KNNBase(BaseEstimator):
def __init__(self, n_neighbors = 5, distance_func = euclidean):
self.n_neighbours = n_neighbors
self.distance_func = euclidean

def _fit(self, X, Y):
self.X = X
self.Y = Y
return self
``````

The KNNBase class is initialized in the `__init__` method with 2 parameters namely the number of neighbors and the distance function. The former denotes the number of neighbors and determines the prediction for a particular data point while the latter determines how the distance is calculated.

The default values are taken as 5 for `n_neighbors` and `euclidean` for `distance_func` (any function from `scipy.spatial.distance` will do for `distance_func`).

The `_fit` method takes in the training features `X` and targets `Y` and stores them as instance variables. It then returns the now updated instance via `self`.

``````    def _vote(self, neighbors_targets):
raise NotImplementedError()

def _predict(self, X_test):
Y_pred = np.empty(X_test.shape[0])
for index, entry in enumerate(X_test):
distances = [self.distance_func(i, entry) for i in self.X]
neighbors = np.argsort(distances)[: self.n_neighbors]

neighbors_prediction = [self.Y[j] for j in neighbors]

Y_pred[index] = self._vote(neighbors_prediction)
return Y_pred
``````

The `_vote` method is a helper function that is not implemented in `KNNBase` class since it will be different for the classifier and regressor. However, its aim is to return the single prediction for a data point given its neighbor's targets.

The `_predict` method is the powerhouse of the algorithm and it takes in the data points `X_test` who's targets are to be predicted. The first thing we do is instantiate an empty array with the same number of elements as there are data point in `X_test`.

Next up, we loop through `X_test` and for each data point, the distances from every training example are calculated with the euclidean function.

These distances are then sorted with the `np.argsort` function and this returns the sorted array but instead of the values themselves as elements, it returns their indexes. This is useful to us because we need the indexes to get the corresponding predicted values for the nearest neighbors.

The first k neighbors are sliced out of the sorted array and the targets at those points stored in `neighbors_prediction`. A vote is made using these values by calling on the `self._vote` method and the result is stored by filling up the once empty array `Y_pred`.

The complete `KNNBase` class is thus:

``````class KNNBase(BaseEstimator):
def __init__(self, n_neighbors = 5, distance_func = euclidean):
self.n_neighbors = n_neighbors
self.distance_func = euclidean

def fit(self, X, Y):
self.X = X
self.Y = Y
return self

def _vote(self, neighbors_targets):
raise NotImplementedError()

def predict(self, X_test):
"""
Predict the targets of the test data.
"""

Y_pred = np.empty(X_test.shape[0])
for index, entry in enumerate(X_test):
distances = [self.distance_func(i,entry) for i in self.X]
neighbors = np.argsort(distances)[:self.n_neighbors]

neighbors_prediction = [self.Y[j] for j in neighbors]

Y_pred[index] = self._vote(neighbors_prediction)
return Y_pred
``````

The Classifier

``````class KNNClassifier(KNNBase):

def _vote(self, neighbors_target):
count_ = np.bincount(neighbors_target)
return np.argmax(count_)

``````

This class inherits from the `KNNBase` class and implements the `_vote` method. In the `_vote` method, we take in the `neighbors_target` list and use the `np.bincount` method to generate the frequency of each class in the `neighbors_target` list. Then we return the argument with the highest value with `np.argmax`. If there is a tie for the most common label among the neighbors, then the predicted label is arbitrary.

The Regressor

``````class KNNRegressor(KNNBase):
def _vote(self, neighbors_target):
return np.mean(neighbors_target)
``````

This class also inherits from the KNNBase class and implements the `_vote` method. However, in the `_vote` method, we take in the `neighbors_target` list and use the `np.mean` function to find and return the mean of all its values.
To be sure that our algorithm works, we will compare its result on the Boston housing dataset with that of the sklearn implementation.

``````from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
X, y = load_boston(return_X_y = True)
model = KNNRegressor().fit(X,y)
pred= model.predict(X[:10])
print("For the first 10 data points,")
print(f"Our implementation gives: {pred}")

model1= KNeighborsRegressor().fit(X,y)
pred1= model1.predict(X[:10])
print(f"The Sklearn Implementation gives: {pred1}")
``````
``````For the first 10 data points,
Our implementation gives: [21.78 22.9  25.36 26.06 27.1  27.1  20.88 19.1  18.4  19.48]
The Sklearn Implementation gives: [21.78 22.9  25.36 26.06 27.1  27.1  20.88 19.1  18.4  19.48]```

``````

We have now successfully implemented the K-Nearest Neighbour algorithm. Feel free to test it on any other dataset.

I hope you enjoyed reading this. If you have any questions, you can contact me on Twitter @phortz.