ANALYZING THE IMPACT OF RESAMPLING METHOD FOR IMBALANCED DATA TEXT IN INDONESIAN SCIENTIFIC ARTICLES CATEGORIZATION

Ariani Indrawati, Hendro Subagyo, Andre Sihombing, Wagiyah Wagiyah, Sjaeful Afandi

Abstract


The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given misleading results. It is caused because machine learning algorithms are designated to work best with balanced data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues, the most popular technique is resampling the dataset to modify the number of instances in the majority and minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or combined both of them, have been proposed and continue until now. Resampling techniques may increase or decrease the classifier performance. Comparative research on resampling methods in structured data has been widely carried out, but studies that compare resampling methods with unstructured data are very rarely conducted. That raises many questions, one of which is whether this method is applied to unstructured data such as text that has large dimensions and very diverse characters. To understand how different resampling techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis using various resampling methods with several classification algorithms to classify articles at the Indonesian Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced data text generally to improve the classifier performance but they are doesn’t give significant result because data text has very diverse and large dimensions.


Keywords


Imbalanced data; Resampling techniques; Machine learning; Classification; Journal; ISJD

Full Text:

PDF

References


Al-Azani, S. & El-Alfy, E. 2017. Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text. Procedia Computer Science. doi: 109. 359-366. 10.1016/j.procs.2017.05.365.

Batista, G., et al. 2004. A Study of The Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations, 6(1), 20-29. doi: 10.1145/1007730.1007735.

Blagus, R. & Lusa, L. 2013. SMOTE for High-Dimensional Class-Imbalanced Data. BMC Bioinformatics, 14, 106. doi: 10.1186/1471-2105-14-106.

Chawla, et al. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. doi: 10.1613/jair.953.

Fernández, A., et al. 2017. An Insight into Imbalanced Big Data Classification: Outcomes and Challenges. Complex & Intelligent Systems. doi: 10.1007/s40747-017-0037-9.

Han, H., et al. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, 878-887. doi: 10.1007/11538059_91.

He, H, et al. 2008.‭ ‬Adasyn:‭ ‬Adaptive Synthetic Samplingapproach For Imbalanced Learning.‭ ‬ International Joint Conference on‬ Neural Networks‭, June. 10.1109/IJCNN.2008.4633969‬‬.‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬

Krawczyk, B. 2016. Learning from Imbalanced Data: Open Challenges and Future Directions. Progress in Artificial Intelligence, 5, 221–232. doi: 10.1007/s13748-016-0094-0.

Last, F., et al. 2017. Oversampling for Imbalanced Learning Based on K-Means and SMOTE.

Li, Y., et al. 2010. Data Imbalance Problem in Text Classification. Third International Symposium on Information Processing, 301-305. doi: 10.1109/ISIP.2010.47.

Loyola-González, O. 2016. Study of the Impact of Resampling Methods for Contrast Pattern based Classifiers in Imbalanced Databases. Neurocomputing, 175, 935-947. doi: 10.1016/j.neucom.2015.04.120.

Padurariu, Cristian & Breaban, Mihaela. 2019. Dealing with Data Imbalance in Text Classification. Procedia Computer Science, 159, 736-745. doi: 10.1016/j.procs.2019.09.229.

Suh, Y, et al. 2017. A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles. Journal of Cognitive Science, 18. 391-437. doi: 10.17791/jcs.2017.18.4.391.

Tomek, I. 1976. Two Modifications of CNN. iIEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11), 769-772. doi: 10.1109/TSMC.1976.4309452.

Wilson, D.L. 1972. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408-421, doi: 10.1109/TSMC.1972.4309137.

Xie, J., et al. 2020. Fused Variable Screening for Massive Imbalanced Data. Computational Statistics & Data Analysis. 141. doi: 10.1016/j.csda.2019.06.013.

Yanminsun, Y. 2011. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence, 23. doi: 10.1142/S0218001409007326.

Zhang, C., et al. 2018. A Cost-Sensitive Deep Belief Network for Imbalanced Classification.




DOI: https://doi.org/10.14203/j.baca.v41i2.702



Copyright (c) 2020 BACA: JURNAL DOKUMENTASI DAN INFORMASI

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Jl. Jend.Gatot Subroto No.10, South Jakarta, DKI Jakarta12710, Indonesia
Copyright 2019 by PDDI LIPI, Design by Slamet Riyanto