kNearest Neighbors Regression
Definition and Purpose
kNearest Neighbors (kNN) regression is a nonparametric, instancebased learning algorithm used in machine learning to predict continuous output values based on the values of the nearest neighbors in the feature space. It estimates the output for a new data point by averaging the outputs of its k
closest neighbors. The main purpose of kNN regression is to predict continuous values by leveraging the similarity to existing labeled data.
Key Objectives:
 Regression: Predicting continuous output values based on the average or weighted average of the nearest neighbors' values.
 Estimation: Determining the likely value of a new data point by considering its neighbors.
 Understanding Relationships: Identifying similar data points in the feature space and using their values to make predictions.
How kNN Regression Works
1. Distance Metric: The algorithm uses a distance metric (commonly Euclidean distance) to determine the "closeness" of data points.

Euclidean Distance:
d(p, q) = sqrt((p1  q1)^2 + (p2  q2)^2 + ... + (pn  qn)^2)
 Measures the straightline distance between two points
p
andq
in ndimensional space.
2. Choosing k: The parameter k
specifies the number of nearest neighbors to consider for making the regression prediction.
 Small k: Can lead to overfitting, where the model is too sensitive to the training data.
 Large k: Can lead to underfitting, where the model is too generalized and may miss finer patterns in the data.
3. Prediction: The predicted value for a new data point is the average of the values of its k
nearest neighbors.

Simple Average:
 Sum the values of the
k
neighbors.  Divide by
k
to get the average.
 Sum the values of the

Weighted Average:
 Weigh each neighbor's value by the inverse of its distance.
 Sum the weighted values.
 Divide by the sum of the weights to get the weighted average.
Key Concepts
NonParametric: kNN is a nonparametric method, meaning it makes no assumptions about the underlying distribution of the data. This makes it flexible in handling various types of data.
InstanceBased Learning: The algorithm stores the entire training dataset and makes predictions based on the local patterns in the data. It is also known as a "lazy" learning algorithm because it delays processing until a query is made.
Distance Calculation: The choice of distance metric can significantly affect the model's performance. Common metrics include Euclidean, Manhattan, and Minkowski distances.
Choice of k: The value of
k
is a critical hyperparameter. Crossvalidation is often used to determine the optimal value ofk
for a given dataset.
kNearest Neighbors Regression Example
This example demonstrates how to use kNN regression with polynomial features to model complex relationships while leveraging the nonparametric nature of kNN.
Python Code Example
1. Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
This block imports the necessary libraries for data manipulation, plotting, and machine learning.
2. Generate Sample Data
np.random.seed(42) # For reproducibility
X = np.linspace(0, 10, 100).reshape(1, 1)
y = 3 * X.ravel() + np.sin(2 * X.ravel()) * 5 + np.random.normal(0, 1, 100)
This block generates sample data representing a relationship with some noise, simulating realworld data variations.
3. Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This block splits the dataset into training and testing sets for model evaluation.
4. Create Polynomial Features
degree = 3 # Change this value for different polynomial degrees
poly = PolynomialFeatures(degree=degree)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)
This block generates polynomial features from the training and testing datasets, allowing the model to capture nonlinear relationships.
5. Create and Train the kNN Regression Model
k = 5 # Number of neighbors
knn_model = KNeighborsRegressor(n_neighbors=k)
knn_model.fit(X_poly_train, y_train)
This block initializes the kNN regression model and trains it using the polynomial features derived from the training dataset.
6. Make Predictions
y_pred = knn_model.predict(X_poly_test)
This block uses the trained model to make predictions on the test set.
7. Plot the Results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', alpha=0.5, label='Data Points')
X_grid = np.linspace(0, 10, 1000).reshape(1, 1)
X_poly_grid = poly.transform(X_grid)
y_grid = knn_model.predict(X_poly_grid)
plt.plot(X_grid, y_grid, color='red', linewidth=2, label=f'kNN Regression (k={k}, Degree {degree})')
plt.title(f'kNN Regression (Polynomial Degree {degree})')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(True)
plt.show()
This block creates a scatter plot of the actual data points versus the predicted values from the kNN regression model, visualizing the fitted curve.
Output with k = 1:
Output with k = 10:
This structured approach demonstrates how to implement and evaluate kNearest Neighbors regression with polynomial features. By capturing local patterns through averaging the responses of nearby neighbors, kNN regression effectively models complex relationships in data while providing a straightforward implementation. The choice of k and polynomial degree significantly influences the model's performance and flexibility in capturing underlying trends.
Top comments (0)