Victor Isaac Oshimua

Posted on Jul 8, 2023 • Originally published at cyberholics.hashnode.dev on Jun 21, 2023

How to organise machine learning project using CRISP-DM methodology

#datascience #machinelearning #project #ai

Building a machine learning system is an iterative process that involves a series of distinct stages, each playing a crucial role in the overall success of the project.

Experts in the field have developed various frameworks to effectively structure and organize machine learning projects. These frameworks serve as valuable guides, providing a systematic approach to project planning, execution, and evaluation.

One of the frameworks devised by experts is the CRISP-DM methodology. The CRISP - DM methodology is a structured approach to organising a machine learning project.

Let's delve into it and see how this methodology can help you organise your next machine learning project.

N/B This article assumes you understand machine learning and know how to build a machine learning model; therefore, this article is aimed at guiding you on how to structure your next machine learning project.

What is CRISP-DM?

CRISP-DM is an acronym that stands for the Cross-Industry Standard Process for Data Mining. It was invented in 1996, and it is still used to organise machine learning projects to date.

CRISP-DM breaks the machine learning process into six parts:

Business understanding: This involves identifying the problem you're trying to solve with machine learning and also deciding if machine learning will be a proper solution to the problem.
Data understanding: This entails performing data analysis and exploration of the available dataset. Here, you verify the quality of the data and decide if you need to collect more.
Data preparation: In this step, the data is cleaned and transformed for modeling. This is a very important step because the outcome of this step determines the outcome of the next step. It is a well-established fact that flawed or inadequate data leads to the development of subpar models. A significant amount of time is usually spent on this phase.
Modeling: This is the step most machine learning enthusiasts look forward to. Here you train a machine learning model with the prepared data.
Evaluation: This is where you evaluate the performance of the machine learning model to see if it solves the business problem and measure its success at doing that.
Deployment: Finally, you deploy the model to production for end users to consume. You could build the best model in the world, but the model is useless if it's not available for end users.

The execution of these steps does not follow a strict sequential order, they can be fluid and iterative, as indicated by the diagram above.

In the typical process, once you complete the final step, you are expected to revisit the first step to refine the problem being addressed and make adjustments based on the insights gained.

The last step does not signify the conclusion of the process but rather an opportunity to revise the problem and identify areas for improvement in the other steps.

Implementation of CRISP-DM: A Step-by-Step Guide for Project Success

Now that you've understood CRISP-DM, let us look at a problem example where we follow the steps of CRISP-DM to build a machine learning project.

Let us create an example scenario:

Suppose you want to build a churn prediction system. The goal of this system is to detect customers who are likely to churn on your product or service.

To learn more about customer churn, follow this link:

Let us delve into how we can arrange this project with the CRISP-DM methodology.

Business understanding phase

You noticed that users of your products reduced by some percentage over a period of time, which may lead to a loss in your business, and you decided to seek a solution.

In this step, you analyze possible solutions and consider if machine learning is necessary to solve this problem or look for other ways to solve it.

Here, you define your goals. In this case, your goal is to "reduce the churn rate on your product" or "retain customers that haven't churned."

After considering every possible solution, you might decide to go with machine learning. You could possibly train a machine learning model that will identify customers that are likely to churn and therefore send these customers promotional emails that contain discounts on your product.

Data understanding phase

After you've analysed the problem and decided to go with machine learning as a solution, the next step is to gather the available data. Here you have to describe, explore, and visualise the data.

In this step, you answer questions like:

Do I have enough data?
Is my data source reliable?
Do I need to get more data?

If these questions are not adequately answered, you need to go back to the previous step and redefine your goals.

Data preparation phase

Raw data is usually messy. This step involves building a pipeline for data preprocessing. The pipeline is a piece of code that follows a sequence of steps.

The steps are:

Raw data collection
Data transformation
Data cleaning

Next, you have to transform the data in such a way that it can be used as input for a machine learning model. In this case of churn prediction, you have to transform the data into a tabular form that contains features and target variables.

Modeling phase

This is where actual machine learning happens; here you use different machine learning algorithms depending on the kind of problem. In this case of churn prediction, you can build a classification model, an ensemble model, etc.

To improve the accuracy of your model, you quite often need to go back to the data preparation step to add more features or fix data issues, then retrain the model to see if it improves.

Note that you don't need to achieve perfection. What you want is a good baseline model to work with. Don't stress building a 99% accurate model; future iterations of the CRISP-DM lifecycle will help you improve your baseline model.

Evaluation phase

This is where you need to go back to the goals you defined in the business understanding step. Here, you evaluate the model and determine if you reached the defined goals. This is done by looking at an important business metric and making sure that the model moves the metric in the right direction.

In the churn prediction system, your goal could be to retain customers that haven't churned. In this case, you have to evaluate if the model accurately identifies churning customers.

Deployment phase

Deployment is all about making the model available for end users; a machine learning model is more or less useless if it stays in your Jupyter notebook.

Here, you focus more on engineering practices like monitoring and maintainability. This is because a deployed machine learning model has to be reliable.

It is common to combine the deployment and evaluation stages, which provides a method for testing the effectiveness of your model.

In this case of churn prediction, you can use the model on a subset of your customers and analyse the result based on your business metrics, such as the reduction in churning customers. You can expect to notice a decreased churn rate within this subset in comparison to the remaining customers.

Conclusion

It is important to debunk the common misconception that building a machine learning solution is all about training and tuning machine learning models.
The CRISP-DM methodology clearly demonstrates that there are several crucial steps that precede and follow the modeling step.
These pre- and post-modeling steps play a significant role in ensuring the success of a machine learning project. Understanding and embracing the comprehensive nature of these steps is key to achieving optimal outcomes when building a machine learning solution.

Thank you for taking the time to read this article. I sincerely hope that you find it valuable and informative, providing you with enough guidance to confidently apply the CRISP-DM methodology to your upcoming machine learning projects. Should you choose to adopt this framework, I believe it will contribute significantly to the success and organization of your projects.

Do well to follow me on Twitter or LinkedIn to stay updated on the content I share. I regularly provide insightful information and resources related to machine learning and other relevant topics. Your support and engagement on these platforms are greatly appreciated.

DEV Community

How to organise machine learning project using CRISP-DM methodology

What is CRISP-DM?

Implementation of CRISP-DM: A Step-by-Step Guide for Project Success

Business understanding phase

Data understanding phase

Data preparation phase

Modeling phase

Evaluation phase

Deployment phase

Conclusion

Top comments (0)

Read next

Distilling System 2 into System 1

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

PaliGemma: A versatile 3B VLM for transfer

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards