In the data science field where things evolve fast and codes can soon become obsolete, the ability to know how to structure projects can not be overstated. As data projects become complex, adopting a modular code approach emerges as a strategic imperative. Unfortunately, many new learners in the field of data science aren’t taught this modular approach to structuring data science scripts. The reason partly being that many data boot camps or data schools don’t include an end-to-end approach or method, which uses modularization of different stages as the industry standards.
This guide examines the substantial advantages of structuring data science projects using modular code approaches, highlights the inherent limitations of conventional tools such as Jupyter Notebooks, presents a recommended project structure, and provides a comprehensive demonstration utilizing the iconic Iris dataset.
Outline
Code Modularity
Advantages of Modular Code in Data Science Projects
Limitations of Jupyter Notebook
Common Machine Learning Lifecycle
Demonstration Using The Iris Dataset
Conclusion
Code Modularity
The practice of breaking down a program into separate components (more appropriate, modules, in programming parlance) whereby each component or module is responsible for a specific functionality is known as code modularity. In essence, when applied to a data science problem, code modularity is all about breaking down different stages of a data project into separate modules or scripts.
Advantages of Modular Code
Code Organization and Readability
The strategic compartmentalization of code into modules fosters a sense of clarity, disentangling intricate project structures. This approach heightens the readability of the codebase, enabling a more lucid understanding of individual components.Code Reusability
Modules serve as reusable building blocks that transcend project boundaries. Thoughtfully designed modules can be repurposed, eliminating redundancy and expediting the development lifecycle.Collaboration and Teamwork
A modular structure paves the way for seamless collaboration among team members. Each module can be independently developed, tested, and maintained, allowing concurrent progress on different aspects of the project.Maintainability and Debugging
Debugging becomes a more straightforward endeavor as issues are confined within specific modules. Changes made to one module can be contained and tested, reducing unintended side effects throughout the codebase.
Limitations of Jupyter Notebook
Generally, data scientists love the Jupyter Notebook, myself also included. Jupyter Notebook arguably birthed the inspiration behind Google Colab and Deepnote. The Jupyter Notebook is so popular that several IDEs (VS Code, JetBrains DataSpell, etc.) support it.
While the Jupyter Notebook offers interactivity and is widely accepted and adopted, it carries certain limitations that hinder efficient project organization, among which are:
Monolithic Structure
This is best explained as a situation where codes from data ingestion all through to the stage of model deployment (or some other steps in the data science process) are in one codebase, for instance in a single Jupyter Notebook.
One major disadvantage is that it can make the code more difficult to maintain and debug. Since all of the code is in a single notebook, it can be challenging to find and fix errors or to update the code without affecting other parts of the notebook
Version Control and Collaboration
Steps of different stages in data science projects are repetitive and iterative, thus it becomes very necessary to have a version control system in place to keep track of changes over time, and when necessary to revert to previous versions if something goes wrong. While version control systems such as Git can also track changes to the code inside a Jupyter Notebook, it can be more challenging to track changes to the codebase. In addition, separating codes into modules not only encourages easy maintainability and reuse, it also fosters collaboration, because different engineers can work on different versions and modules concurrently.
Therefore, at best, Jupyter Notebooks are most useful for Exploratory Data analysis (EDA) and prototyping.
Typical Stages of Machine Learning Project
Depending on the complexity of a typical machine learning project, most data science/ML projects have some or more of these structures: data ingestion and loading, data preprocessing, data exploration, feature engineering and feature selection, model training and model evaluation, model deployment, etc.
Data Ingestion and loading: The process of importing data from one or more sources to a targeted location, either for storage or immediate use. Common data sources include APIs, databases/ data warehouses, etc.
Data Preprocessing: A critical step in every data science project that involves cleaning and transforming data into useful formats suitable for analysis
Data Exploration and Visualization: The process of examining data to understand it by summarizing it, and identifying patterns and relationships, sometimes with the aid of visualizations. It is a critical step that reveals potential concerns such as missing values, outliers, etc.
Feature Engineering and Selection: While feature engineering involves crafting or transforming more features or attributes (e.g. log transform, binning, etc.) intending to improve the performance of machine learning models, feature selection, on the other hand, is aimed at identifying and removing irrelevant and redundant features which do not contribute to the model accuracy, based on some certain rules.
Model Training and Evaluation: model training includes making the machine learning algorithm learn patterns from the training data to make predictions, while (model) evaluation is the process of accessing the performance of a model on test data using a particular evaluation setup. The goal of the former is to minimize errors between predictions and the actual outcomes, while the latter aims to measure (evaluate) the performance of the model on unseen data.
Model Deployment: This involves the process of making a trained model available for end users in production, to use for prediction on new data.
Demonstration Using The Iris Dataset
To elucidate the potency of modularity, let's apply this approach to the popular Iris dataset.
1. Load data
We will leverage Scikit-learn's built-in functionality to effortlessly load the Iris dataset. The Iris dataset version is tied to the version of Scikit-learn presently installed on our machine. The Scikit-learn version used in this project is version 1.3.0. To know your own Scikit-learn version, you can execute the code here in either of Python terminal or Jupyter notebook
import sklearn
print(sklearn.__version__)
To load the dataset, use the code snippet below and save the code as data_loading.py in a directory. In this same directory, we would save the rest of the codes further down this article.
# data_loading.py
from sklearn.datasets import load_iris
# create the function to load the data
def load_data():
iris = load_iris()
data = iris.data
target = iris.target
return data, target
To see the output, add and execute the following code as the latter end of the data_loading.py.
data, target = load_data()
print("Data:")
print(data)
print(“Target:")
print(target)
To run data_loading.py, open your command line or terminal, navigate to the directory where you saved script.py (in our case data_loading.py), and then type “python data_loading.py”, after running it, the output should look as below:
2. Data preprocessing
The only preprocess step here is to split our data into train and test data, and then we save our data as data_preprocessing.py.
# data_preprocessing.py
from data_loading import load_data # to access the data and target from data_loading.py file
from sklearn.model_selection import train_test_split
def preprocess_data(data, target):
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
To see the output for the X_train, X_test, y_train and y_test respectively, running the code below as part of the data_preprocess.py will print out X_train, X_test, y_train and y_test:
data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
print("X_train:")
print(X_train)
print("X_test:")
print(X_test)
print("y_train:")
print(y_train)
print("y_test:")
print(y_test)
Output for y_train and y_test is as below:
3. Feature engineering/selection
These modules/stages can be extended based on specific project requirements. Our demo data (Iris dataset) is small in our case, and these stages might not be necessary.
4. Model training
We will train a Random Forest Classifier on the Iris dataset to create and save as model_training.py in the same directory as before.
# model_training.py
from data_loading import load_data # to access the data and target from data_loading.py file
from data_preprocessing import preprocess_data # to access the function from data_preprocessing.py file
from sklearn.ensemble import RandomForestClassifier
# create a function to fit the algorithm to learn from the data
def train_model(X_train, y_train):
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
return model
data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
model = train_model(X_train, y_train)
print(model)
The output here is nothing but the actual model, i.e
RandomForestClassifier(random_state=42)
5. Model evaluation
In this, we will evaluate our model and save the file as model_evaluation.py
# model_evaluation.py
from data_loading import load_data # from data_loading.py file
from data_preprocessing import preprocess_data # from data_preprocessing.py file
from model_training import train_model # from model_training.py file
from sklearn.metrics import accuracy_score
def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return accuracy
data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
model = train_model(X_train, y_train)
accuracy = evaluate_model(model, X_test, y_test)
print(f"The accuracy of the model is {accuracy:.2f}")
The output of our codes above would display the accuracy of the model as 1.00.
In a real and ideal situation, an accuracy of 100% (1.00) would call for a concern; this, most of the time would suggest “overfitting”: a situation where the model has learned the training data so well, and thus not performing well on the new data. However in our case, the simplicity (small and not messy) of our data here is for a mere demonstrative purpose for code modularity, and not the subject of overfitting and accuracy.
Having said that, let’s consider how the hyperparameter code would look like even though, our accuracy has already churned out a perfect metric
6. Hyperparameter tuning
We will fine-tune the model's hyperparameters using a GridSearchCV approach and save it as hyperparameter_tuning.py with the code below:
# hyperparameter_tuning.py
from data_loading import load_data # from data_loading.py file
from data_preprocessing import preprocess_data # from data_preprocessing.py file
from model_training import train_model # from model_training.py file
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
def tune_hyperparameters(X_train, y_train):
param_grid = {
# tuning just 3:
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=3, verbose=1)
grid_search.fit(X_train, y_train)
return grid_search.best_params_
data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
best_params = tune_hyperparameters(X_train, y_train)
print(f"The best hyperparameters are {best_params}")
The output of hyperparameter tuning:
7. The main script
Now, having had the separate files (.py) for all the stages or steps of building our model, the main.py script (below) orchestrates the entire workflow from data ingestion to hyperparameter tuning using the modular components:
# main.py
import data_loading
import data_preprocessing
import hyperparameter_tuning
import model_training
import model_evaluation
if __name__ == "__main__":
# Load Data
data, target = data_loading.load_data()
# Preprocess Data
X_train, X_test, y_train, y_test = data_preprocessing.preprocess_data(data, target)
# Hyperparameter Tuning
best_params = hyperparameter_tuning.tune_hyperparameters(X_train, y_train)
print("Best hyperparameters:", best_params)
# Train Model
model = model_training.train_model(X_train, y_train)
# Evaluate Model
accuracy = model_evaluation.evaluate_model(model, X_test, y_test)
print("Model accuracy:", accuracy)
The outputs will be the individual output of each of the separate code files, culminating in displaying the accuracy.
Conclusion
In the expansive space of data science projects, embracing a modular code approach unleashes an array of benefits such as propelling towards a clean project organization, reusability, collaboration, and maintainability to new heights. The iris dataset used for illustration purposes might have been quite small and not messy; but notwithstanding, the idea is clear. Generally, as the code (.py or .r file) is methodically separated into well-defined modules, data scientists lay a solid foundation for efficient development and streamlined project management. Although Jupyter notebooks possess their own merits, they grapple with limitations about code organization, version control, and modularity.
As a promising extension, there is a follow-up post to this, whereby there will be incorporation of the Scikit-learn Pipeline module, to further augment the project's modular development experience. The next article will examine the advantages of Pipelines, spotlighting the features of pipelines to automate and standardize data processing and modeling steps, thereby amplifying the development journey. Click to read.
Top comments (0)