How to use CleanLab wisely

Data quality plays an important role in model performance. While sophisticated algorithms continue to emerge, the axiom "garbage in, garbage out" remains true. CleanLab emerges as a groundbreaking solution to this persistent challenge, offering a systematic approach to identifying and correcting label errors in datasets.
The CleanLab represents a paradigm shift in data quality management by implementing confident learning algorithms. This framework automatically identifies potential label errors. So, I search several models to identify through using CleanLab to figure out which one could be the best to enhance the performance.

Linear Models

Linear models play a crucial role in CleanLab's data cleaning framework. Their effectiveness stems from several key characteristics that make them particularly valuable for identifying label errors and ensuring data quality.

Linear models are quite effective in CleanLab. Especially, the robust baseline performance are quite stable and they have reliable probability estimates.

from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning

# Creating an interpretable linear model
linear_model = LogisticRegression(
    C=1.0,
    class_weight='balanced',
    random_state=42
)

# Integrating with CleanLab
cl = CleanLearning(linear_model)

SVM

support vector machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. It's particularly effective in high-dimensional spaces and is widely used in various machine learning applications

The key concepts of SVM are Margin, Support Vectors, and the Kernel Trick. Margin is the distance between the decision boundary and the nearest data points. The SVM aims to maximize this margin. and Larger margins generally lead to better generalization. Support Vectors points closest to the decision boundary. This is the critical points that define the margin. And only these points affect the devision boundary. Finally, the Kernel Trick Transforms non-linear problems into linear ones. And it maps data into higher-dimensional space.

# Common kernel options in sklearn
from sklearn.svm import SVC

# Linear kernel
linear_svm = SVC(kernel='linear')

# RBF (Gaussian) kernel
rbf_svm = SVC(kernel='rbf')

# Polynomial kernel
poly_svm = SVC(kernel='poly', degree=3)

Random Forest Classifier

Random Forest Classifier is particularly effective in CleanLab due to its ensemble nature and robust probability estimates. Multiple decision trees provide robust predictions. And this naturally handling the outliers and noise.

from sklearn.ensemble import RandomForestClassifier
from cleanlab.classification import CleanLearning

# Basic setup
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    bootstrap=True,
    random_state=42
)

# Integration with CleanLab
cl = CleanLearning(rf_model)

XGBoost

XGBoost is an ensemble learning technique that builds models sequentially. And each new model tries to correct errors made by previous models. Gradient boosting uses gradient descent to minimize errors. And combines weak learners into a strong predictor.

import xgboost as xgb
from cleanlab.classification import CleanLearning

# Basic XGBoost setup
xgb_model = xgb.XGBClassifier(
    n_estimators=100,        # Number of boosting rounds
    learning_rate=0.1,       # Step size shrinkage
    max_depth=5,             # Maximum tree depth
    min_child_weight=1,      # Minimum sum of instance weight
    subsample=0.8,           # Subsample ratio of training instances
    colsample_bytree=0.8,    # Subsample ratio of columns
    objective='binary:logistic'  # Objective function
)

Soft Voting Ensemble

Finally, we can make soft voting ensemble to make cleanlab more effectively. Soft voting combines probability predictions from multiple models by averaging their predicted probabilities. rather than just taking the majority vote of predicted classes.

from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from cleanlab.classification import CleanLearning

# Initialize base models
svm_model = SVC(probability=True, kernel='rbf')
rf_model = RandomForestClassifier(n_estimators=100)
xgb_model = xgb.XGBClassifier(n_estimators=100)
lr_model = LogisticRegression()

# Create voting classifier
ensemble = VotingClassifier(
    estimators=[
        ('svm', svm_model),
        ('rf', rf_model),
        ('xgb', xgb_model),
        ('lr', lr_model)
    ],
    voting='soft'
)

# Integrate with CleanLab
cl_ensemble = CleanLearning(ensemble)