DEV Community

Cover image for Beginner's Guide to Scikit-Learn (sklearn) πŸ“š
Anand
Anand

Posted on

Beginner's Guide to Scikit-Learn (sklearn) πŸ“š

What is Scikit-Learn? πŸ€”

Scikit-Learn is a popular Python library that provides simple and efficient tools for data mining, data analysis, and machine learning. It’s built on top of other libraries like NumPy, SciPy, and Matplotlib, making it a great choice for building both simple and complex models.


Image description


Key Features of Scikit-Learn 🌟

  1. Simple and Consistent Interface: All machine learning models in sklearn follow the same basic interface. Once you learn one, you can use them all!
  2. Wide Range of Algorithms: It includes algorithms for classification, regression, clustering, and more.
  3. Preprocessing Tools: Easily clean and prepare your data with tools for scaling, normalization, and encoding.
  4. Model Evaluation: Multiple metrics and tools for validating your models.

Installing Scikit-Learn πŸ“₯

Before we begin, let's make sure you have Scikit-Learn installed. If you don't have it installed yet, you can easily get it using pip:

pip install scikit-learn
Enter fullscreen mode Exit fullscreen mode

Now, let's get started with some basic examples! πŸŽ‰


Example 1: Loading and Understanding Data πŸ—‚οΈ

First things first, let’s load some data! Scikit-Learn comes with a bunch of built-in datasets. We’ll use the famous Iris dataset for our example.

from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Check out the features and labels
print("Features:", iris.feature_names)
print("Labels:", iris.target_names)

# Display the first 5 records
print("First 5 records:\n", iris.data[:5])
Enter fullscreen mode Exit fullscreen mode

Output:

Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Labels: ['setosa' 'versicolor' 'virginica']
First 5 records:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.0 1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.0 3.6 1.4 0.2]]
Enter fullscreen mode Exit fullscreen mode

Example 2: Splitting the Data 🎲

Before training a model, it’s important to split your data into training and testing sets. This helps you evaluate the performance of your model on unseen data.

from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))
Enter fullscreen mode Exit fullscreen mode

Output:

Training set size: 120
Testing set size: 30
Enter fullscreen mode Exit fullscreen mode

Example 3: Building a Simple Classifier 🧠

Let’s build a simple classification model using the k-Nearest Neighbors (k-NN) algorithm. It's a great algorithm for beginners!

from sklearn.neighbors import KNeighborsClassifier

# Initialize the model
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X_train, y_train)

# Make predictions on the test set
predictions = knn.predict(X_test)

# Display predictions
print("Predictions:", predictions)
print("Actual Labels:", y_test)
Enter fullscreen mode Exit fullscreen mode

Output:

Predictions: [0 1 2 1 1 0 2 1 1 2 2 1 0 0 2 1 1 1 2 0 0 0 2 2 0 1 2 0 0 1]
Actual Labels: [0 1 2 1 1 0 2 1 1 2 2 1 0 0 2 1 1 1 2 0 0 0 2 2 0 1 2 0 0 1]
Enter fullscreen mode Exit fullscreen mode

Example 4: Evaluating the Model πŸ“Š

Now, let's evaluate our model's performance using accuracy, one of the simplest metrics.

from sklearn.metrics import accuracy_score

# Calculate the accuracy
accuracy = accuracy_score(y_test, predictions)
print("Model Accuracy:", accuracy)
Enter fullscreen mode Exit fullscreen mode

Output:

Model Accuracy: 1.0
Enter fullscreen mode Exit fullscreen mode

Machine Learning with Scikit-Learn 🌟

let’s explore some more machine learning techniques using Scikit-Learn. We’ll look at examples of Regression, Clustering, and Dimensionality Reduction. These are key concepts in machine learning, and Scikit-Learn makes it super easy to implement them. Let’s dive in! πŸŠβ€β™‚οΈ


Example 1: Linear Regression πŸ“ˆ

Linear Regression predicts a continuous value, such as house prices or temperature. It's one of the simplest and most widely used regression techniques.

Problem Statement:

Let’s predict the relationship between a person's BMI and their weight.

from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data: BMI and corresponding weights
X = np.array([[18.5], [24.9], [30.0], [35.0], [40.0]])  # BMI
y = np.array([60, 70, 80, 90, 100])  # Weight in kg

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Predict weight for a BMI of 28.0
predicted_weight = model.predict([[28.0]])
print("Predicted weight for BMI 28.0:", predicted_weight[0], "kg")
Enter fullscreen mode Exit fullscreen mode

Output:

Predicted weight for BMI 28.0: 76.923 kg
Enter fullscreen mode Exit fullscreen mode

Example 2: K-Means Clustering 🎨

K-Means Clustering is an unsupervised learning algorithm used to group similar data points into clusters. It’s useful when you want to identify patterns or groupings in your data.

Problem Statement:

Group customers based on their spending habits.

from sklearn.cluster import KMeans

# Sample data: Annual Income and Spending Score
X = np.array([[15, 39], [16, 81], [17, 6], [18, 77], [19, 40], [20, 76]])

# Initialize the model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# Predict the cluster for a new customer with income 18 and spending score 50
cluster = kmeans.predict([[18, 50]])
print("Cluster for new customer:", cluster[0])
Enter fullscreen mode Exit fullscreen mode

Output:

Cluster for new customer: 1
Enter fullscreen mode Exit fullscreen mode

Example 3: Principal Component Analysis (PCA) 🌐

Principal Component Analysis (PCA) is a dimensionality reduction technique. It’s often used to reduce the number of features in a dataset while retaining most of the variance (information).

Problem Statement:

Reduce the dimensionality of the Iris dataset to 2 components.

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Initialize PCA with 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Print the reduced feature set
print("Reduced feature set:\n", X_reduced[:5])
Enter fullscreen mode Exit fullscreen mode

Output:

Reduced feature set:
 [[-2.68412563  0.31939725]
 [-2.71414169 -0.17700123]
 [-2.88899057 -0.14494943]
 [-2.74534286 -0.31829898]
 [-2.72871654  0.32675451]]
Enter fullscreen mode Exit fullscreen mode

Conclusion πŸŽ‰

Congrats! You've just built and evaluated your first machine-learning model using Scikit-Learn. πŸ’ͺ As you can see, Scikit-Learn makes it easy to get started with machine learning, thanks to its simple and consistent interface.

These examples are just the tip of the iceberg! The more you practice, the better you'll get. Machine learning is a vast field, but with tools like Scikit-Learn, you can explore it one step at a time.Keep exploring, try out different datasets and algorithms, and most importantly, have fun! Machine learning is a vast and exciting field, and Scikit-Learn is a fantastic tool for helping you.

Happy coding! πŸ˜„

NOTE: If you’re excited to learn more, don’t hesitate to experiment with other algorithms in sklearn.The possibilities are endless! πŸš€


About Me:
πŸ–‡οΈLinkedIn
πŸ§‘β€πŸ’»GitHub

Top comments (2)

Collapse
 
sc0v0ne profile image
sc0v0ne

Very good, congratulations on the post. The use of the library is well explained.

Collapse
 
kammarianand profile image
Anand

Thank you 😊