๐ Table of Contents
- ๐ Welcome to Day 4
- ๐ Review of Day 3
- ๐ Introduction to Model Evaluation and Selection
- ๐ Model Validation Techniques
- โ๏ธ Hyperparameter Tuning
- ๐ ๏ธ Implementing Model Evaluation and Selection with Scikit-Learn
- ๐ Comparing Models
- ๐ ๏ธ๐ Example Project: Comparing Models with Cross-Validation and Hyperparameter Tuning
- ๐๐ Conclusion and Next Steps
- ๐ Summary of Day 4 ๐
1. ๐ Welcome to Day 4
Welcome to Day 4 of "Becoming a Scikit-Learn Boss in 90 Days"! Today, we'll focus on Model Evaluation and Selection, essential steps to ensure your machine learning models are both accurate and generalizable. You'll learn about various evaluation techniques, validation strategies, and hyperparameter tuning methods to optimize your models effectively.
2. ๐ Review of Day 3
Before diving into today's topics, let's briefly recap what we covered yesterday:
- Supervised Learning: Regression Algorithms: Explored Linear Regression, Ridge Regression, Lasso Regression, and Elastic Net.
- Implementing Regression Algorithms with Scikit-Learn: Built, trained, and evaluated different regression models.
- Model Evaluation for Regression: Learned about Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (Rยฒ).
- Example Project: Housing Price Prediction: Developed a regression pipeline to predict housing prices and compared model performances.
With this foundation, we're ready to enhance our models through robust evaluation and selection techniques.
3. ๐ Introduction to Model Evaluation and Selection
๐ Why Model Evaluation Matters
Model evaluation is crucial to determine how well your machine learning model performs on unseen data. It helps in:
- Assessing Performance: Understanding the strengths and weaknesses of your model.
- Preventing Overfitting: Ensuring the model generalizes well to new data.
- Comparing Models: Selecting the best model among various candidates.
- Optimizing Hyperparameters: Fine-tuning model parameters for optimal performance.
๐ง Bias-Variance Tradeoff
Understanding the bias-variance tradeoff is fundamental in model evaluation:
- Bias: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause underfitting.
- Variance: Error due to too much complexity in the learning algorithm. High variance can cause overfitting.
- Tradeoff: Striking a balance between bias and variance leads to better generalization.
Image Source: Medium
4. ๐ Model Validation Techniques
๐ Train-Test Split
The simplest form of model validation where the dataset is split into training and testing sets.
from sklearn.model_selection import train_test_split
# Assume X and y are already defined
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Pros:
- Simple and fast.
Cons:
- Can lead to high variance in performance metrics depending on the split.
๐ Cross-Validation
A more robust method that involves partitioning the data into multiple subsets to ensure the model's performance is consistent across different data splits.
๐งฎ K-Fold Cross-Validation
Divides the dataset into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, each time with a different fold as the test set.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-Validation Rยฒ Scores: {scores}")
print(f"Average Rยฒ Score: {scores.mean():.2f}")
๐ Stratified K-Fold
Ensures that each fold has the same proportion of classes as the entire dataset, useful for imbalanced datasets.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, y):
X_train_fold, X_test_fold = X[train_index], X[test_index]
y_train_fold, y_test_fold = y[train_index], y[test_index]
# Train and evaluate your model
๐ Leave-One-Out Cross-Validation (LOOCV)
Each observation is used once as a test set while the remaining observations form the training set. Best for small datasets.
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
X_train_fold, X_test_fold = X[train_index], X[test_index]
y_train_fold, y_test_fold = y[train_index], y[test_index]
# Train and evaluate your model
5. โ๏ธ Hyperparameter Tuning
๐ง Importance of Hyperparameters
Hyperparameters are parameters set before the learning process begins. They control the behavior of the training algorithm and can significantly impact model performance.
๐ Grid Search
An exhaustive search over a specified parameter grid. It evaluates all possible combinations of hyperparameters.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5]
}
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Rยฒ Score: {grid_search.best_score_:.2f}")
๐ Randomized Search
Searches a random subset of the hyperparameter space, making it faster than Grid Search, especially with large datasets or many parameters.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': [None, 10, 20, 30],
'min_samples_split': randint(2, 11)
}
random_search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=42),
param_distributions=param_dist,
n_iter=50,
cv=5,
scoring='r2',
random_state=42,
n_jobs=-1)
random_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best Rยฒ Score: {random_search.best_score_:.2f}")
โจ Bayesian Optimization
A more efficient method that builds a probabilistic model of the objective function and uses it to select the most promising hyperparameters to evaluate.
Note: Requires additional libraries like scikit-optimize
.
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestRegressor
bayes_search = BayesSearchCV(
estimator=RandomForestRegressor(random_state=42),
search_spaces={
'n_estimators': (100, 500),
'max_depth': (10, 50),
'min_samples_split': (2, 20)
},
n_iter=32,
cv=5,
scoring='r2',
random_state=42,
n_jobs=-1
)
bayes_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {bayes_search.best_params_}")
print(f"Best Rยฒ Score: {bayes_search.best_score_:.2f}")
6. ๐ ๏ธ Implementing Model Evaluation and Selection with Scikit-Learn
๐ Performing K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-Validation Rยฒ Scores: {scores}")
print(f"Average Rยฒ Score: {scores.mean():.2f}")
๐ง Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Rยฒ Score: {grid_search.best_score_:.2f}")
๐ง Hyperparameter Tuning with RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': [None, 10, 20, 30],
'min_samples_split': randint(2, 11)
}
random_search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=42),
param_distributions=param_dist,
n_iter=50,
cv=5,
scoring='r2',
random_state=42,
n_jobs=-1)
random_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best Rยฒ Score: {random_search.best_score_:.2f}")
7. ๐ Comparing Models
๐ Performance Metrics Comparison
After training multiple models, it's essential to compare their performance metrics to select the best one.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
models = {
'Linear Regression': linear_reg,
'Ridge Regression': ridge_reg,
'Lasso Regression': lasso_reg,
'Elastic Net': elastic_net
}
evaluation_df = pd.DataFrame(columns=['Model', 'MSE', 'RMSE', 'MAE', 'Rยฒ'])
for name, model in models.items():
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
evaluation_df = evaluation_df.append({
'Model': name,
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'Rยฒ': r2
}, ignore_index=True)
print(evaluation_df)
๐ Visualizing Model Performance
Visual representations can help in comparing the performance of different models effectively.
import seaborn as sns
import matplotlib.pyplot as plt
# Plot Rยฒ Scores
plt.figure(figsize=(10, 6))
sns.barplot(x='Rยฒ', y='Model', data=evaluation_df, palette='viridis')
plt.title('Rยฒ Score Comparison of Regression Models')
plt.xlabel('Rยฒ Score')
plt.ylabel('Model')
plt.xlim(0, 1)
plt.show()
# Plot RMSE
plt.figure(figsize=(10, 6))
sns.barplot(x='RMSE', y='Model', data=evaluation_df, palette='magma')
plt.title('RMSE Comparison of Regression Models')
plt.xlabel('RMSE')
plt.ylabel('Model')
plt.show()
8. ๐ Example Project: Comparing Models with Cross-Validation and Hyperparameter Tuning
Let's apply today's concepts by developing a comprehensive regression pipeline to predict housing prices using the California Housing Dataset. We'll compare multiple regression algorithms, perform cross-validation, and tune hyperparameters to optimize model performance.
๐ Project Overview
Objective: Develop and compare different regression models to predict median housing prices based on various features. Implement cross-validation and hyperparameter tuning to enhance model accuracy and generalizability.
Tools: Python, Scikit-Learn, pandas, Matplotlib, Seaborn
๐ Step-by-Step Guide
1. Load and Explore the Dataset
from sklearn.datasets import fetch_california_housing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load California Housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal')
# Combine features and target
df = pd.concat([X, y], axis=1)
print(df.head())
# Visualize distribution of target variable
sns.histplot(df['MedHouseVal'], bins=50, kde=True)
plt.title('Distribution of Median House Values')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.show()
2. Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
3. Building and Training the Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
# Initialize models
linear_reg = LinearRegression()
ridge_reg = Ridge(alpha=1.0)
lasso_reg = Lasso(alpha=0.1)
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
random_forest = RandomForestRegressor(random_state=42)
svm_reg = SVR(kernel='linear')
# Train models
linear_reg.fit(X_train_scaled, y_train)
ridge_reg.fit(X_train_scaled, y_train)
lasso_reg.fit(X_train_scaled, y_train)
elastic_net.fit(X_train_scaled, y_train)
random_forest.fit(X_train_scaled, y_train)
svm_reg.fit(X_train_scaled, y_train)
4. Making Predictions and Evaluating the Models
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
models = {
'Linear Regression': linear_reg,
'Ridge Regression': ridge_reg,
'Lasso Regression': lasso_reg,
'Elastic Net': elastic_net,
'Random Forest': random_forest,
'Support Vector Machine': svm_reg
}
evaluation_df = pd.DataFrame(columns=['Model', 'MSE', 'RMSE', 'MAE', 'Rยฒ'])
for name, model in models.items():
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
evaluation_df = evaluation_df.append({
'Model': name,
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'Rยฒ': r2
}, ignore_index=True)
print(evaluation_df)
5. Performing K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score
# Perform cross-validation for Random Forest
cv_scores = cross_val_score(random_forest, X, y, cv=5, scoring='r2')
print(f"Random Forest Cross-Validation Rยฒ Scores: {cv_scores}")
print(f"Average Cross-Validation Rยฒ Score: {cv_scores.mean():.2f}")
6. Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
# Define parameter grid for Random Forest
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Rยฒ Score: {grid_search.best_score_:.2f}")
7. Hyperparameter Tuning with RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': [None, 10, 20, 30, 40],
'min_samples_split': randint(2, 21)
}
random_search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=42),
param_distributions=param_dist,
n_iter=50,
cv=5,
scoring='r2',
random_state=42,
n_jobs=-1)
random_search.fit(X_train_scaled, y_train)
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best Cross-Validation Rยฒ Score: {random_search.best_score_:.2f}")
8. Comparing Model Performance
import seaborn as sns
import matplotlib.pyplot as plt
# Plot Rยฒ Scores
plt.figure(figsize=(10, 6))
sns.barplot(x='Rยฒ', y='Model', data=evaluation_df, palette='viridis')
plt.title('Rยฒ Score Comparison of Regression Models')
plt.xlabel('Rยฒ Score')
plt.ylabel('Model')
plt.xlim(0, 1)
plt.show()
# Plot RMSE
plt.figure(figsize=(10, 6))
sns.barplot(x='RMSE', y='Model', data=evaluation_df, palette='magma')
plt.title('RMSE Comparison of Regression Models')
plt.xlabel('RMSE')
plt.ylabel('Model')
plt.show()
9. ๐ Conclusion and Next Steps
Congratulations on completing Day 4 of "Becoming a Scikit-Learn Boss in 90 Days"! Today, you mastered Model Evaluation and Selection, learning how to validate your models effectively through cross-validation, perform hyperparameter tuning with Grid Search and Randomized Search, and compare multiple regression models to select the best performer.
๐ฎ Whatโs Next?
- Day 5: Unsupervised Learning โ Clustering and Dimensionality Reduction: Understand clustering algorithms like K-Means and techniques like PCA.
- Day 6: Advanced Feature Engineering: Master techniques to create and select features that enhance model performance.
- Day 7: Ensemble Methods: Explore ensemble techniques like Bagging, Boosting, and Stacking.
- Day 8: Model Deployment with Scikit-Learn: Learn how to deploy your models into production environments.
- Days 9-90: Specialized Topics and Projects: Engage in specialized topics and comprehensive projects to solidify your expertise.
๐ Tips for Success
- Practice Regularly: Apply the concepts through exercises and real-world projects.
- Engage with the Community: Join forums, attend webinars, and collaborate with peers.
- Stay Curious: Continuously explore new features and updates in Scikit-Learn.
- Document Your Work: Keep a detailed journal of your learning progress and projects.
Keep up the great work, and stay motivated as you continue your journey to mastering Scikit-Learn and machine learning!
๐ Summary of Day 4
- ๐ Introduction to Model Evaluation and Selection: Learned the importance of model evaluation and the bias-variance tradeoff.
- ๐ Model Validation Techniques: Explored Train-Test Split, K-Fold Cross-Validation, Stratified K-Fold, and Leave-One-Out Cross-Validation.
- โ๏ธ Hyperparameter Tuning: Mastered Grid Search, Randomized Search, and Bayesian Optimization for tuning model parameters.
- ๐ ๏ธ Implementing Model Evaluation and Selection with Scikit-Learn: Practiced cross-validation and hyperparameter tuning using GridSearchCV and RandomizedSearchCV.
- ๐ Comparing Models: Compared different regression models using performance metrics and visualizations.
- ๐ ๏ธ๐ Example Project: Comparing Models with Cross-Validation and Hyperparameter Tuning: Developed a comprehensive regression pipeline to predict housing prices, evaluated multiple models, and optimized their performance through cross-validation and hyperparameter tuning.
Top comments (0)