## DEV Community

Harsh Mishra

Posted on • Updated on

# K Nearest Neighbors Regression, Regression: Supervised Machine Learning

### k-Nearest Neighbors Regression

#### Definition and Purpose

k-Nearest Neighbors (k-NN) regression is a non-parametric, instance-based learning algorithm used in machine learning to predict continuous output values based on the values of the nearest neighbors in the feature space. It estimates the output for a new data point by averaging the outputs of its `k` closest neighbors. The main purpose of k-NN regression is to predict continuous values by leveraging the similarity to existing labeled data.

#### Key Objectives:

• Regression: Predicting continuous output values based on the average or weighted average of the nearest neighbors' values.
• Estimation: Determining the likely value of a new data point by considering its neighbors.
• Understanding Relationships: Identifying similar data points in the feature space and using their values to make predictions.

### How k-NN Regression Works

1. Distance Metric: The algorithm uses a distance metric (commonly Euclidean distance) to determine the "closeness" of data points.

• Euclidean Distance:
• `d(p, q) = sqrt((p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2)`
• Measures the straight-line distance between two points `p` and `q` in n-dimensional space.

2. Choosing k: The parameter `k` specifies the number of nearest neighbors to consider for making the regression prediction.

• Small k: Can lead to overfitting, where the model is too sensitive to the training data.
• Large k: Can lead to underfitting, where the model is too generalized and may miss finer patterns in the data.

3. Prediction: The predicted value for a new data point is the average of the values of its `k` nearest neighbors.

• Simple Average:
• Sum the values of the `k` neighbors.
• Divide by `k` to get the average.
• Weighted Average:
• Weigh each neighbor's value by the inverse of its distance.
• Sum the weighted values.
• Divide by the sum of the weights to get the weighted average.

### Key Concepts

1. Non-Parametric: k-NN is a non-parametric method, meaning it makes no assumptions about the underlying distribution of the data. This makes it flexible in handling various types of data.

2. Instance-Based Learning: The algorithm stores the entire training dataset and makes predictions based on the local patterns in the data. It is also known as a "lazy" learning algorithm because it delays processing until a query is made.

3. Distance Calculation: The choice of distance metric can significantly affect the model's performance. Common metrics include Euclidean, Manhattan, and Minkowski distances.

4. Choice of k: The value of `k` is a critical hyperparameter. Cross-validation is often used to determine the optimal value of `k` for a given dataset.

### k-Nearest Neighbors Regression Example

This example demonstrates how to use k-NN regression with polynomial features to model complex relationships while leveraging the non-parametric nature of k-NN.

#### Python Code Example

1. Import Libraries

``````import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
``````

This block imports the necessary libraries for data manipulation, plotting, and machine learning.

2. Generate Sample Data

``````np.random.seed(42)  # For reproducibility
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * X.ravel() + np.sin(2 * X.ravel()) * 5 + np.random.normal(0, 1, 100)
``````

This block generates sample data representing a relationship with some noise, simulating real-world data variations.

3. Split the Dataset

``````X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
``````

This block splits the dataset into training and testing sets for model evaluation.

4. Create Polynomial Features

``````degree = 3  # Change this value for different polynomial degrees
poly = PolynomialFeatures(degree=degree)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)
``````

This block generates polynomial features from the training and testing datasets, allowing the model to capture non-linear relationships.

5. Create and Train the k-NN Regression Model

``````k = 5  # Number of neighbors
knn_model = KNeighborsRegressor(n_neighbors=k)
knn_model.fit(X_poly_train, y_train)
``````

This block initializes the k-NN regression model and trains it using the polynomial features derived from the training dataset.

6. Make Predictions

``````y_pred = knn_model.predict(X_poly_test)
``````

This block uses the trained model to make predictions on the test set.

7. Plot the Results

``````plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', alpha=0.5, label='Data Points')
X_grid = np.linspace(0, 10, 1000).reshape(-1, 1)
X_poly_grid = poly.transform(X_grid)
y_grid = knn_model.predict(X_poly_grid)
plt.plot(X_grid, y_grid, color='red', linewidth=2, label=f'k-NN Regression (k={k}, Degree {degree})')
plt.title(f'k-NN Regression (Polynomial Degree {degree})')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(True)
plt.show()
``````

This block creates a scatter plot of the actual data points versus the predicted values from the k-NN regression model, visualizing the fitted curve.

`Output with k = 1:`

`Output with k = 10:`

This structured approach demonstrates how to implement and evaluate k-Nearest Neighbors regression with polynomial features. By capturing local patterns through averaging the responses of nearby neighbors, k-NN regression effectively models complex relationships in data while providing a straightforward implementation. The choice of k and polynomial degree significantly influences the model's performance and flexibility in capturing underlying trends.