Class imbalance issues

#classimbalance #oversampling #undersampling #smote

This work is base on the article :" The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression".

Ruben van den Goorbergh1, Maarten van Smeden 1, Dirk Timmerman2,3, and
Ben Van Calster 2,4,5

Intro

Imagine, that you are teaching a smart computer to identify dogs and cats using pictures. And this process of the computer learning to perform this task is call model prediction. The way it works, depend on the type of program we are using (logistic regression) and the kind of problem we are looking to solve. In our case, we just want to teach the computer how to identify dogs or cats.
In this process of teaching the computers on how to perform this task, we figure out that we have more pictures of dogs than cats. This is what we call class imbalance. To fix this, we can decide to remove pictures of dogs or adding more pictures of cats, this process is called random undersampling or oversampling. There is also a technique, call SMOTE, that can be used to create more pictures of cats to solve this imbalance issue.

What was the original context of this paper when it was written?

What is the impact of addressing the class imbalance issue mentioning above, such as not having balanced amount of pictures of dogs and cats to train our smart computer ?

This article: is investigating the impact of applying the common methods generally used to rectify this class imbalance issue. Suggesting that these methods approaches might compromise prediction accuracy.
The researchers are assessing the models' performance in terms of how well it distinguishes between the pictures (discrimination), the accuracy of its probability (calibration) and how effectively it categorizes things (Classification).

The paper is using a real world case, predicting ovarian cancer, to illustrate these findings.

Summary of the paper findings/outcomes

The training dataset included information from 2695 women, 518 of with ovarian cancer, and the test set had data from 674 women, including 140 with ovarian cancer.
Big surprise, fixing the imbalance did not improve the capacity in doing a better job.
They have found that no matter method they used, the accuracy stayed around 79% to 80%. And even worst, trying to fix the imbalance results in making the computer identify wrongly more people with a potential event of interest than actually they were. Like using SMOTE, led to our model to overestimation of ovarian cases. The result of such outcomes can lead patients to unnecessary treatments or actions. And that can harm the patients for nothing, and create and unnecessary expenses in the medical system.
The study indicate that the common methods that we are using to fix imbalance might not be as helpful as we think. They can drive even to less accurate predictions.

How can this paper inform your work as a junior

As juniors, we are exited most of the time about the outcomes, regardless of the methodology that we are using. This article highlights how a simple and common methodology, can impact our results. It emphasizes the necessity to interpret the results in line with the context or method use in the process of addressing class imbalance. Understanding the implication of using these models predictions to make decisions, especially in scenarios like healthcare.

Why is this paper important/why does it matter to a non-technical business stakeholder?

The paper, highlight the importance of understanding the implications of using predictive modelling for decision-making. Non-technical business stakeholders should not blindly rely on imbalance correction method predictions, considering the potential negative impacts they might result. Moreover, these predictions should be assessed, taking into account the associated risks and impacts.

DEV Community

Class imbalance issues

Intro

What was the original context of this paper when it was written?

Summary of the paper findings/outcomes

How can this paper inform your work as a junior

Why is this paper important/why does it matter to a non-technical business stakeholder?

Top comments (0)

Read next

Python Day-26 List comprehension-Exercises

The Future of Rust Programming and My Experience with Rust-Based Tools

How Pod Creation Happens in Kubernetes? Understand Full K8s Workflow

Understanding Lambda, Map, and Filter in Python