I made a follow-up package, scikit-multilearn-ng, to the widely used scikit-multilearn package for multilabel classification

#datascience #python #opensource #machinelearning

After needing to use scikit-multilearn and detecting errors, I opened a PR and waited. But after double checking I saw that there hadn't been any commits in 7 months (now 9 months) and that it had not been a release since 2018, I dug in and found out that no one had access to the PyPi credentials and so on. So I opened a discussion about creating a fork and many were eager for it.

So after some developing, I'm here to introduce scikit-multilearn-ng (GitHub: https://github.com/scikit-multilearn-ng/scikit-multilearn-ng), an advanced, open-source tool for multi-label classification in Python. It's a direct successor to scikit-multilearn and brings a host of improvements and new features.

What Makes scikit-multilearn-ng Stand Out?

Enhanced Integration with scikit-learn: This package not only integrates with the scikit-learn ecosystem but also extends its capabilities, making it a natural fit for those familiar with scikit-learn.
Expanded Algorithm Collection: Among its new offerings are StructuredGridSearchCV and the SMiLE algorithm, specifically designed for more complex multi-label classification tasks, including handling missing labels and heterogeneous features.
Open Source Philosophy: As a community-driven project, it's free to use and open for contributions, perfect for collaborative development.

Why Should You Consider Upgrading?

Ease of Transition: For those already using scikit-multilearn, upgrading is as simple as switching the dependency to scikit-multilearn-ng. Your existing code will work without any changes.
Active Development and Support: scikit-multilearn-ng offers bug fixes and new features, ensuring your projects stay current and robust.

Whether you're a seasoned Python developer or just starting out in machine learning, scikit-multilearn-ng is worth exploring.

Some Example Use Cases:

A simple example use case is iterative splitting multilabel data between train and test data while trying to maintain the distribution of each label between the training and test sets. This is particularly useful for datasets where certain label combinations are rare.

from skmultilearn.model_selection import iterative_train_test_split
import numpy as np

# Assuming X is your feature matrix and y is your label matrix
# X should be a numpy array or a sparse matrix
# y should be a binary indicator matrix (each label is either 0 or 1)

# Define the size of your test set
test_size = 0.2

# Perform the split
# The function returns flattened arrays, so you need to reshape them
X_train, y_train, X_test, y_test = iterative_train_test_split(X, y, test_size = test_size)

# Reshape the outputs back to the original shapes
num_labels = y.shape[1]
y_train = y_train.reshape(-1, num_labels)
y_test = y_test.reshape(-1, num_labels)

But it also supports advanced problem transformations to single label problems:

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC

# Initialize and train
classifier = BinaryRelevance(classifier=SVC(), require_dense=[False, True])
classifier.fit(X_train, y_train)

# Predict
predictions = classifier.predict(X_test)

Please contribute and star the project!

I'm looking forward to your feedback, questions, and how you might use it in your projects!

DEV Community

I made a follow-up package, scikit-multilearn-ng, to the widely used scikit-multilearn package for multilabel classification

What Makes scikit-multilearn-ng Stand Out?

Why Should You Consider Upgrading?

Some Example Use Cases:

Please contribute and star the project!

Top comments (0)

Read next

The Adventures of Blink #20: Facial Recognition with Python

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models