Handling Data Imbalance Problem in Hybrid Resampling Approach to Improve Accuracy of K-Nearest Neighbors Algorithm

Novriadi Antonius Siagian; Sardo Pardingotan Sipayung

doi:10.54209/jurnalinstall.v16i02.207

Download PDF

Published: Jun 27, 2024

DOI: https://doi.org/10.54209/jurnalinstall.v16i02.207

Keywords:

Data Imbalance, Oversampling, Undersampling, Smote, Nearmiss, K-Nearest Neighbors

Issue

Vol. 16 No. 02 (2024): Instal : Jurnal Komputer

Section

Articles

Statistics Article

Article View : 313 Times

Novriadi Antonius Siagian

Universitas Katolik Santo Thomas

Sardo Pardingotan Sipayung

Universitas Katolik Santo Thomas

Abstract

Handling the problem of data imbalance is a crucial challenge in the development of classification models, especially in medical data such as stroke detection. This study proposes a hybrid resampling approach of SMOTE (Synthetic Minority Over-sampling Technique) and NearMiss to improve the accuracy of K-Nearest Neighbors (KNN) algorithm on stroke datasets. Our hybrid resampling approach aims to overcome the shortcomings of each resampling technique, with SMOTE generating minority class samples and NearMiss subtracting samples from the majority class. We test this approach on a stroke dataset that has class imbalance. The method was evaluated using K-NN. The experimental results show that the hybrid approach can improve the accuracy of K-NN in predicting the minority class compared to the conventional approach. It shows that adjusting these parameters can significantly affect the performance of the hybrid approach. In this study, providing the highest accuracy in SMOTE with K-1 neighbors resulted in a 100% improvement in accuracy, followed by a 97% improvement with K-2, and a 93% accuracy with K-3. On the other hand, the undersampling approach using NearMiss showed 100% accuracy improvement with K-1, followed by 74% improvement with K-2, and 76% accuracy with K-3. In conclusion, the use of SMOTE proved to be more consistent in improving accuracy with higher K values. In this case, it is important to consider various parameters in choosing the right resampling technique to handle data imbalance.

How to Cite

Antonius Siagian, N., & Sipayung, S. P. (2024). Handling Data Imbalance Problem in Hybrid Resampling Approach to Improve Accuracy of K-Nearest Neighbors Algorithm. Instal : Jurnal Komputer, 16(02), 78–87. https://doi.org/10.54209/jurnalinstall.v16i02.207

References

[1] O. Volk and G. Singer, “An adaptive cost-sensitive learning approach in neural networks to minimize local training–test class distributions mismatch,” Intell. Syst. with Appl., vol. 21, 2024, doi: 10.1016/j.iswa.2023.200316.

[2] S. N. Kalid, K. C. Khor, K. H. Ng, and G. K. Tong, “Detecting Frauds and Payment Defaults on Credit Card Data Inherited with Imbalanced Class Distribution and Overlapping Class Problems: A Systematic Review,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3362831.

[3] A. Ahmad, A. A. Rather, A. M. Gemeay, M. Nagy, L. P. Sapkota, and A. H. Mansi, “Novel sin-G class of distributions with an illustration of Lomax distribution: Properties and data analysis,” AIP Adv., vol. 14, no. 3, 2024, doi: 10.1063/5.0180263.

[4] Z. Sun, G. Wang, P. Li, H. Wang, M. Zhang, and X. Liang, “An improved random forest based on the classification accuracy and correlation measurement of decision trees,” Expert Syst. Appl., vol. 237, 2024, doi: 10.1016/j.eswa.2023.121549.

[5] C. C. Lin, D. J. Deng, C. H. Kuo, and L. Chen, “Concept drift detection and adaption in big imbalance industrial IoT data using an ensemble learning method of offline classifiers,” IEEE Access, vol. 7, 2019, doi: 10.1109/ACCESS.2019.2912631.

[6] O. Gonzalez, “Summary Intervals for Model-Based Classification Accuracy and Consistency Indices,” Educ. Psychol. Meas., vol. 83, no. 2, 2023, doi: 10.1177/00131644221092347.

[7] N. G. Siddappa and T. Kampalappa, “Imbalance Data Classification Using Local Mahalanobis Distance Learning Based on Nearest Neighbor,” SN Comput. Sci., vol. 1, no. 2, 2020, doi: 10.1007/s42979-020-0085-x.

[8] M. Seera, C. P. Lim, A. Kumar, L. Dhamotharan, and K. H. Tan, “An intelligent payment card fraud detection system,” Ann. Oper. Res., vol. 334, no. 1–3, 2024, doi: 10.1007/s10479-021-04149-2.

[9] X. Zhu et al., “Intelligent financial fraud detection practices in post-pandemic era,” Innovation, vol. 2, no. 4. 2021, doi: 10.1016/j.xinn.2021.100176.

[10] A. Ali et al., “Financial Fraud Detection Based on Machine Learning: A Systematic Literature Review,” Applied Sciences (Switzerland), vol. 12, no. 19. 2022, doi: 10.3390/app12199637.

[11] M. Uppal et al., “Enhancing accuracy in brain stroke detection: Multi-layer perceptron with Adadelta, RMSProp and AdaMax optimizers,” Front. Bioeng. Biotechnol., vol. 11, 2023, doi: 10.3389/fbioe.2023.1257591.

[12] I. Kosmidis, E. C. Kenne Pagui, and N. Sartori, “Mean and median bias reduction in generalized linear models,” Stat. Comput., vol. 30, no. 1, 2020, doi: 10.1007/s11222-019-09860-6.

[13] C. Sweeney, E. Ennis, M. Mulvenna, R. Bond, and S. O’neill, “How Machine Learning Classification Accuracy Changes in a Happiness Dataset with Different Demographic Groups,” Computers, vol. 11, no. 5, 2022, doi: 10.3390/computers11050083.

[14] Q. Nguyen, I. Diaz-Rainey, A. Kitto, B. I. McNeil, N. A. Pittman, and R. Zhang, “Scope 3 emissions: Data quality and machine learning prediction accuracy,” PLOS Clim., vol. 2, no. 11, 2023, doi: 10.1371/journal.pclm.0000208.

[15] Z. Nasreddine, V. Garibotto, S. Kyaga, and A. Padovani, “The Early Diagnosis of Alzheimer’s Disease: A Patient-Centred Conversation with the Care Team,” Neurology and Therapy, vol. 12, no. 1. 2023, doi: 10.1007/s40120-022-00428-7.

[16] S. Feng, J. Keung, X. Yu, Y. Xiao, and M. Zhang, “Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction,” Inf. Softw. Technol., vol. 139, 2021, doi: 10.1016/j.infsof.2021.106662.

[17] A. Li, M. Liu, and S. Sheather, “Predicting stock splits using ensemble machine learning and SMOTE oversampling,” Pacific Basin Financ. J., vol. 78, 2023, doi: 10.1016/j.pacfin.2023.101948.

[18] M. Alauthman et al., “Enhancing Small Medical Dataset Classification Performance Using GAN,” Informatics, vol. 10, no. 1, 2023, doi: 10.3390/informatics10010028.

[19] X. Wang, H. Zhang, S. Bai, and Y. Yue, “Design of agile satellite constellation based on hybrid-resampling particle swarm optimization method,” Acta Astronaut., vol. 178, 2021, doi: 10.1016/j.actaastro.2020.09.040.

[20] T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Inf., vol. 14, no. 1, 2023, doi: 10.3390/info14010054.

[21] G. AlMahadin, A. Lotfi, M. M. Carthy, and P. Breedon, “Enhanced Parkinson’s Disease Tremor Severity Classification by Combining Signal Processing with Resampling Techniques,” SN Comput. Sci., vol. 3, no. 1, 2022, doi: 10.1007/s42979-021-00953-6.

[22] A. Azab, M. Khasawneh, S. Alrabaee, K. K. R. Choo, and M. Sarsour, “Network traffic classification: Techniques, datasets, and challenges,” Digital Communications and Networks. 2023, doi: 10.1016/j.dcan.2022.09.009.

Copyright and Licensing

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

##plugins.themes.bootstrap3.article.sidebar##

##plugins.themes.bootstrap3.article.main##

Abstract

##plugins.themes.bootstrap3.article.details##