TY - JOUR
T1 - Performance of Classification Algorithms Under Class Imbalance
T2 - Simulation and Real-World Evidence
AU - Arshad, Iqra
AU - Umair, Muhammad
AU - Jan, Faheem
AU - Iftikhar, Hasnain
AU - Canas Rodrigues, Paulo
AU - Ivan Gonzales Medina, Ronny
AU - Linkolk Lopez-Gonzales, Javier
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Class imbalance is a persistent challenge in machine learning, particularly in high-stakes applications such as medical diagnostics, bioinformatics, and fraud detection, where the minority class often represents critical cases that require special attention and consideration. While prior research has examined the effect of imbalance on classifier performance, little attention has been paid to establishing practical guidelines for the minimum proportion of minority samples required to achieve reliable sensitivity. In this study, we conduct extensive simulations using synthetic datasets and evaluate five widely used classification algorithms: Logistic Regression (Logit), Support Vector Machines (SVM), Random Forest, XGBoost, and Neural Networks (NNs). Our analysis reveals that logistic regression is more effective in identifying minority-class instances under an imbalanced class distribution, as measured by F1 score and sensitivity. In contrast, neural networks slightly perform better for a balanced-class distribution than logistic regression. Importantly, we identify a practical threshold for minority class representation: classifier sensitivity declines sharply when the proportion of positive samples falls below approximately 25–30%. This finding is validated on eight real-world datasets, including large-scale applications, where Neural Networks and XGBoost demonstrate superior sensitivity. By establishing an actionable threshold, this study contributes practical guidance for dataset design and model selection in imbalanced classification problems.
AB - Class imbalance is a persistent challenge in machine learning, particularly in high-stakes applications such as medical diagnostics, bioinformatics, and fraud detection, where the minority class often represents critical cases that require special attention and consideration. While prior research has examined the effect of imbalance on classifier performance, little attention has been paid to establishing practical guidelines for the minimum proportion of minority samples required to achieve reliable sensitivity. In this study, we conduct extensive simulations using synthetic datasets and evaluate five widely used classification algorithms: Logistic Regression (Logit), Support Vector Machines (SVM), Random Forest, XGBoost, and Neural Networks (NNs). Our analysis reveals that logistic regression is more effective in identifying minority-class instances under an imbalanced class distribution, as measured by F1 score and sensitivity. In contrast, neural networks slightly perform better for a balanced-class distribution than logistic regression. Importantly, we identify a practical threshold for minority class representation: classifier sensitivity declines sharply when the proportion of positive samples falls below approximately 25–30%. This finding is validated on eight real-world datasets, including large-scale applications, where Neural Networks and XGBoost demonstrate superior sensitivity. By establishing an actionable threshold, this study contributes practical guidance for dataset design and model selection in imbalanced classification problems.
KW - Binary classification
KW - data mining
KW - imbalanced data sets
KW - logistic regression
KW - machine learning
KW - neural networks
UR - https://www.scopus.com/pages/publications/105019571366
U2 - 10.1109/ACCESS.2025.3620264
DO - 10.1109/ACCESS.2025.3620264
M3 - Article
AN - SCOPUS:105019571366
SN - 2169-3536
VL - 13
SP - 179672
EP - 179685
JO - IEEE Access
JF - IEEE Access
ER -