Discover packages and tools that have been pivotal in my coding journey as ML engineer. They not only enhance efficiency but also introduce innovative solutions, reshaping how I tackle problems using Python.
In this series, we will explore five, or less, packages from various categories: ML, Data Engineering Pipelines, Frameworks & DL, Visualization, API & Deployment, Developers Tools, and other Packages I Adore.
This installment is centered on Machine Learning packages. Each package comes with a succinct description, its main advantages, and a sample use-case to highlight it's code design. Where relevant, I'll also provide alternatives or complimentary packages, giving you a holistic perspective on the tools available.
Machine Learning
1. scikit-learn
Description: A comprehensive library for machine learning algorithms.
Advantage: User-friendly with a consistent API and thorough documentation.
When to use: It's the package of choice for standard machine learning tasks, including classification, regression, and clustering.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_models import LogisticRegression
numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
text_feature = 'SelfDescription'
text_transformer = Pipeline(steps=[
('vectorizer', TfidfVectorizer(stop_words="english"))
])
categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('txt', text_transformer, text_feature),
('cat', categorical_transformer, categorical_features),
])
predictor = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
# train
predictor.fit(X_train, y_train)
#evaluate and predict
river
complements scikit-learn
by offering tools specifically designed for online learning, ideal for scenarios where data is streaming in real-time. While scikit-learn
is optimized for batch learning, river
provides a solution for incrementally updating models with new data points as they arrive.
skorch
seamlessly integrates the deep learning capabilities of PyTorch into the scikit-learn
ecosystem. It allows developers to use PyTorch-based neural networks as if they were scikit-learn estimators, making it easier to incorporate deep learning models into workflows that already leverage scikit-learn
tools, such as grid search and pipelines.
2. PyMC
Description: Specialized in Bayesian modeling and probabilistic machine learning.
Advantage: Equips you with the tools to define probabilistic models in code.
When to use: Ideal for white-box ML using probabilistic programming.
import pymc as pm
import xarray as xr
with pm.Model() as model:
# Priors
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10, shape=X_train.shape[1])
# Linear combination
mu = alpha + xr.dot(X_train, beta)
# Logistic link function
p = pm.invlogit(mu)
# Likelihood
y_obs = pm.Bernoulli('y_obs', p=p, observed=y_train)
# Sample/train
trace = pm.sample(3000)
# Evaluation with posterior predictive checks
# Prediction by drawing samples from the posterior predictive distribution
While Stan
offers its own modeling language and provides MCMC sampling, and Edward
integrates with TensorFlow/Keras to offer Variational Inference, PyMC stands out for its ease of use within the Python environment, user-friendly API, and active community.
3. darts
Description: My preferred package for time series forecasting and anomaly detection.
Advantage: It offers comprehensive tools for time series analysis and a unified interface for various forecasting models.
When to use: Essential when dealing with time series data and you need forecasting, anomaly detection, or other analyses using classical, deep learning, prophet models, and beyond.
from darts.models import RNNModel
model_config = {
"model_name": "Sales_LSTM",
"hidden_dim": 20,
"dropout": 0,
"batch_size": 16,
"n_epochs": 200,
"random_state": 42,
"training_length": 20,
"input_chunk_length": 14,
"force_reset": True,
"save_checkpoints": True,
}
model = RNNModel(
model="LSTM",
optimizer_kwargs={"lr": 1e-3},
**model_config
)
# train
model.fit(TimeSeriesData)
# forecast next 3
forecast = model.predict(3)
Alternatives: Merlion and kats
While Merlion
and Kats
offer their own sets of capabilities in time series analysis, Darts
shines as a comprehensive choice for time series forecasting and processing, catering to a wide range of requirements with its extensive toolkit. Both Merlion
and Kats
can serve as potential alternatives, but Dartsβ
holistic offerings make it a standout choice for me.
4. FLAML
Description: A swift and efficient automated machine learning library.
Advantage: Achieve optimal ML results with minimal coding and time investment.
When to use: Perfect when you desire swift outcomes without the intricacies of model fine-tuning.
from flaml import AutoML
automl = AutoML()
automl_config = {
"time_budget": 120, # time in seconds
"metric": 'accuracy',
"task": 'classification',
"estimator_list": ['lgbm', 'xgboost', 'catboost', 'extra_tree',],
"seed": 42,
"log_file_name": "churn.log",
"log_training_metric": True,
}
# train
automl.fit(X_train, y_train, **automl_config)
# evaluate and predict
Complimentary: AutoGluon and mljar-supervised
FLAML
specializes in automating machine learning tasks for tabular data. In contrast, AutoGluon
amplifies the automation game by accommodating a wider spectrum, including text, images, and multi-modal data, making it a more versatile toolkit. Meanwhile, mljar-supervised
extends FLAML
by adding model explanation, ensemble, and visualization, , presenting itself as a viable alternative with comparable capabilities.
5. CVXPY
Description: The go-to library for convex optimization.
Advantage: It provides an intuitive method to define and solve convex optimisation problems.
When to use: Essential for solving optimisation challenges across domains like finance, control, signal processing, and more.
'''
Task: Operations Research
PYIKEA wants to maximize its profit of selling armchair, wingchair, and Lovet-table. The profit of selling armchair is 150 DKK, wingchair 100 DKK, and Lovet-table 250 DKK.
It takes:
15 planks of wood and 5 hours of labour to make one armchair
12 planks of wood and 2 hours of labour to make one wingchair
18 planks of wood and 8 hours of labour to make one Lovet-table
The store needs at least 4 of each chair, and a table. The total amount of woods pieces in storage is 450 and labour budget is of 120 hours only.
What combination of chairs and table(s) yield maximum profit?
'''
import cvxpy as cp
# Variables
A = cp.Variable(integer=True, name="Armchair")
W = cp.Variable(integer=True, name="Wingchair")
L = cp.Variable(integer=True, name="Lovet-table")
# Objective
profit = 150*A + 100*W + 250*L
objective = cp.Maximize(profit)
# Constraints
constraints = [15*A + 12*W + 18*L <= 450,
5*A + 2*W + 8*L <= 120,
A >= 4,
W >= 4,
L >= 1
]
# Problem to Solve
problem = cp.Problem(objective, constraints=constraints)
result = problem.solve()
Alternative: pyomo
Pyomo
and cvxpy
are both my interstellar choices for optimisation in Python, each with its own set of strengths. While cvxpy
excels with its intuitive approach to convex problems, Pyomo
flaunts versatility in tackling a variety of optimisation challenges, be it linear, nonlinear, or mixed-integer. Essentially, picking between the two boils down to the mood I am on of the day!
We navigated through my favourite Python packages that redefine machine learning workflows. I picked scikit-learn for general ML tasks, complemented by tools like river
and skorch
. PyMC guided us through the intricacies of Bayesian modeling with alternatives like Stan
and Edward
. darts emerged as a comprehensive choice for time series analysis, though Merlion
and kats
offer their unique capabilities.
For rapid results, FLAML streamlines automated ML, with AutoGluon
and mljar-supervised
expanding on similar terrains. Lastly, CVXPY showcased its prowess in optimization, with pyomo
as another contender. These packages, collectively, illuminate the expansive and evolving landscape of Python-based machine learning that can elevate your skills, as they did mine.
Stay tuned for the next segment on Data Engineering Pipelines, featuring Dagster
, Apache Airflow
, Prefect
and Argo
.
Until then, stay curious and keep on coding.
Top comments (0)