Research Article
BibTex RIS Cite

A Comparative Evaluation of the Outlier Detection Methods

Year 2024, Volume: 7 Issue: 2, 155 - 159, 15.03.2024
https://doi.org/10.34248/bsengineering.1387431

Abstract

In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by randomly adding 30 outliers to the data set. The iForest algorithm was found to have higher predictive power than Mahalanobis, LOF, k-means, and ABOD. In addition, outliers were found in a real data set with the iForest algorithm and deleted from the data set. Then, the data sets with outliers and without outliers were compared. The results showed that the model without outliers has a higher predictive ability.

Project Number

FDK-2018 10287

References

  • Auslander B, Gupta KM, Aha DW. 2011. A comparative evaluation of anomaly detection algorithms for maritime video surveillance. Proceedings of the Society of Photographic Instrumentation Engineers Conference, June 15-17, Orlando, US, Vol. 8019, pp: 27-40.
  • Bharadiya JP. 2023. A comparative study of business intelligence and artificial intelligence with big data analytics. American J Artific Intel, 7(1): 24-30.
  • Ben-Gal I. 2005. Outlier detection. In Data Mining and Knowledge Discovery Handbook, Springer, Boston, US, pp: 288.
  • Bertizlioglu IN, Ozgonenel O. 2012. Blackout detection using k-means clustering method. ELECO'2012 Electrical and Electronics Engineering Symposium, November 29-December 1, Bursa, Turkiye.
  • Breunig MM, Kriegel HP, Ng RT, Sander J. 2000. LOF: Identifying Density-Based Local Outliers. In ACM Sigmod Record, 29(2): 93-104.
  • Cebeci Z. 2020. Data preprocessing with R in data science. Nobel Academic Publishing, Ankara, Türkiye, opp: 552.
  • Cebeci Z, Cebeci C, Tahtali Y, Bayyurt L. 2022. Two novel outlier detection approaches based on unsupervised possibilistic and fuzzy clustering. PeerJ Comp Sci, 8: e1060.
  • Deb AB, Dey L. 2017. Outlier detection and removal algorithm in k-means and hierarchical clustering. World J Comp Appl Technol, 5(2): 24-29.
  • Filzmoser P, Varmuza K. 2017. Chemometrics: Multivariate Statistical Analysis in Chemometrics. URL: https://CRAN.R-project.org/package=chemometrics. (accessed date: February 10, 2023).
  • Gao R, Zhang T, Sun S, Liu Z. 2019. Research and improvement of isolation forest in detection of local anomaly points. J Physics: Conf Series, 1237(5): 1-6.
  • Gnat S. 2020. Testing the effectiveness of outlier detecting methods in property classification. Real Estate Manag Valuat, 28(4): 81-92.
  • Gogoi P, Bhattacharyya D, Borah B, Kalita JK. 2011. A survey of outlier detection methods in network anomaly identification. Comput J, 54(4): 570-588.
  • Graves E, Drozdov I. 2019. Zelazny7/isofor: Isolation forest anomaly detection. URL: https://github.com/Zelazny7/isofor. (accessed date: February 01, 2023).
  • Han J, Pei J, Pei J. 2012. Data mining: concepts and techniques, Third Edition. Morgan Kaufmann Publishers Elsevier, US, pp: 744.
  • Hou S, Gao J, Wang C. 2023. Order acceptance choice modeling of crowd-sourced delivery services: a systematic comparative study. URL: https://www.techrxiv.org/doi/full/10.36227/techrxiv.24139491.v1 (accessed date: February 23, 2023).
  • Hodge V, Austin J. 2004. A survey of outlier detection methodologies. Artific Intel Rev, 22(2): 85-126.
  • Hofmann M, Klinkenberg R. 2014. RapidMiner: Data mining use cases and business analytics applications. CRC Press, New York, US, pp: 528.
  • Hu Y, Murray W, Australia YS. 2015. Rlof: R parallel implementation of local outlier factor (LOF). URL: https://CRAN.R-project.org/package=Rlof (accessed date: January 12, 2023).
  • Jimenez J. 2015. abodOutlier: angle-based outlier detection. URL: https://CRAN.R-project.org/package=abodOutlier (accessed date: January 12, 2023).
  • Juarto B. 2023. Breast Cancer classification using outlier detection and variance inflation factor. Eng Math Comp Sci J, 5(1): 17-23.
  • Kaya H, Koymen K. 2008. Data mining concept and application areas. Fırat Univ Doğu Araşt Derg, 6(2): 159-164.
  • Kiruthika S, Sowmyarani CN. 2020. Credit card fraud detection using machine learning and deployment of model in public cloud as a web service. Int J Recent Technol Eng, 9(2): 548-552.
  • Kriegel HP, Schubert M, Zimek A. 2008. Angle-based outlier detection in high-dimensional data. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, Las Vegas, US, pp: 444-452.
  • Leys C, Klein O, Dominicy Y, Ley C. 2017. Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance. J Exp Soc Psychol, 74: 150-156.
  • Liu FT, Ting KM, Zhou ZH. 2008. Isolation forest. Eighth IEEE International Conference on Data Mining, December 15-19, Pisa, Italy, pp: 413-422.
  • Mertler CA, Vannatta RA. 2005. Advanced and multivariate statistical methods: practical application and interpretation, 3rd edition. Glendale, Pyrczak Publishing, Los Angeles, US, pp: 234.
  • Molnar C. 2019. Interpretable machine learning: a guide for making black box models explainable. URL: https://christophm.github.io/interpretable-ml-book/ (accessed date: September 20, 2023).
  • Negi SS. 2020. Early prediction of credit card fraud detection using isolation forest tree and local outlier factor machine learning algorithms. A Project Report of Capstone Project-2. Galgotias University, Uttar Pradesh, India, Act No: 14.
  • Nurunnabi A, West G. 2012. Outlier detection in logistic regression: A quest for reliable knowledge from predictive modeling and classification. IEEE 12th international conference on data mining workshops, December 10, pp: 643-652.
  • Omar AAC, Nassif AB. 2023. Lung cancer prediction using machine learning based feature selection: a comparative study. Advances in Science and Engineering Technology International Conferences (ASET), February 20-23, pp: 1-6.
  • Osborne JW, Amy O. 2004. The power of outliers (and why researchers should always check for them). Pract Asses Res Eval, 9(6): 1-12.
  • Prykhodko S, Prykhodko N, Makarova L, Pukhalevych S. 2018. Application of the squared mahalanobis distance for detecting outliers in multivariate non-Gaussian data. 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET), February 20-24, Lviv-Slavske, Ukraine, pp: 962-965.
  • Rousseeuw PJ, Van Zomeren BC. 1990. Unmasking multivariate outliers and leverage points. J American Stat Assoc, 85(411): 633-639.
  • Sharma DK, Chatterjee M, Kaur G, Vavilala S. 2022. Deep learning applications for disease diagnosis. Academic Press, Cambridge, US, pp: 31-51.
  • Vijayakumar V, Divya NS, Sarojini P, Sonika K. 2020. Isolation forest and local outlier factor for credit card fraud detection system. Int J Eng Adv Technol, 9(4): 261-265.
  • Xu X, Liu H, Li L, Yao M. 2018. A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intel Syst, 11(1): 652-662.
  • Yadav J. Sharma M. 2013. A review of k-mean algorithm. Int J Eng Trends Technol, 4(7): 2972-2976.
  • Yucel Altay S. 2014. Using of spatio-temporal data mining for trajectory outlier detection and interpretation in health care services. MS Thesis, Atatürk University, Graduate School of Natural and Applied Sciences, Erzurum, Türkiye, pp: 25-32.
  • Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, Norton GJ, Islam MR, Reynolds A, Mezey J, McClung AM, Bustamante CD, McCouch SR. 2011. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Commun, 2(1): 467.

A Comparative Evaluation of the Outlier Detection Methods

Year 2024, Volume: 7 Issue: 2, 155 - 159, 15.03.2024
https://doi.org/10.34248/bsengineering.1387431

Abstract

In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by randomly adding 30 outliers to the data set. The iForest algorithm was found to have higher predictive power than Mahalanobis, LOF, k-means, and ABOD. In addition, outliers were found in a real data set with the iForest algorithm and deleted from the data set. Then, the data sets with outliers and without outliers were compared. The results showed that the model without outliers has a higher predictive ability.

Ethical Statement

Ethical Consideration Ethics committee approval was not required for this study because of there was no study on animals or humans. The authors confirm that the ethical policies of the journal, as noted on the journal's author guidelines page, have been adhered to.

Supporting Institution

Cukurova University

Project Number

FDK-2018 10287

Thanks

We gratefully thank to Prof. Dr. Zeynel CEBECİ at the Cukurova University for his contributions in this study. We would like to thank Cukurova University Scientific Research Coordinatorship for supporting this study with project number FDK-2018 10287. It was produced from the thesis titled “Comparative Examination of Outlier Detection Methods in Binary Logistics Regression Analysis” at Cukurova University Thesis no: 794371. https://tez.yok.gov.tr/UlusalTezMerkezi/tezSorguSonucYeni.jsp.

References

  • Auslander B, Gupta KM, Aha DW. 2011. A comparative evaluation of anomaly detection algorithms for maritime video surveillance. Proceedings of the Society of Photographic Instrumentation Engineers Conference, June 15-17, Orlando, US, Vol. 8019, pp: 27-40.
  • Bharadiya JP. 2023. A comparative study of business intelligence and artificial intelligence with big data analytics. American J Artific Intel, 7(1): 24-30.
  • Ben-Gal I. 2005. Outlier detection. In Data Mining and Knowledge Discovery Handbook, Springer, Boston, US, pp: 288.
  • Bertizlioglu IN, Ozgonenel O. 2012. Blackout detection using k-means clustering method. ELECO'2012 Electrical and Electronics Engineering Symposium, November 29-December 1, Bursa, Turkiye.
  • Breunig MM, Kriegel HP, Ng RT, Sander J. 2000. LOF: Identifying Density-Based Local Outliers. In ACM Sigmod Record, 29(2): 93-104.
  • Cebeci Z. 2020. Data preprocessing with R in data science. Nobel Academic Publishing, Ankara, Türkiye, opp: 552.
  • Cebeci Z, Cebeci C, Tahtali Y, Bayyurt L. 2022. Two novel outlier detection approaches based on unsupervised possibilistic and fuzzy clustering. PeerJ Comp Sci, 8: e1060.
  • Deb AB, Dey L. 2017. Outlier detection and removal algorithm in k-means and hierarchical clustering. World J Comp Appl Technol, 5(2): 24-29.
  • Filzmoser P, Varmuza K. 2017. Chemometrics: Multivariate Statistical Analysis in Chemometrics. URL: https://CRAN.R-project.org/package=chemometrics. (accessed date: February 10, 2023).
  • Gao R, Zhang T, Sun S, Liu Z. 2019. Research and improvement of isolation forest in detection of local anomaly points. J Physics: Conf Series, 1237(5): 1-6.
  • Gnat S. 2020. Testing the effectiveness of outlier detecting methods in property classification. Real Estate Manag Valuat, 28(4): 81-92.
  • Gogoi P, Bhattacharyya D, Borah B, Kalita JK. 2011. A survey of outlier detection methods in network anomaly identification. Comput J, 54(4): 570-588.
  • Graves E, Drozdov I. 2019. Zelazny7/isofor: Isolation forest anomaly detection. URL: https://github.com/Zelazny7/isofor. (accessed date: February 01, 2023).
  • Han J, Pei J, Pei J. 2012. Data mining: concepts and techniques, Third Edition. Morgan Kaufmann Publishers Elsevier, US, pp: 744.
  • Hou S, Gao J, Wang C. 2023. Order acceptance choice modeling of crowd-sourced delivery services: a systematic comparative study. URL: https://www.techrxiv.org/doi/full/10.36227/techrxiv.24139491.v1 (accessed date: February 23, 2023).
  • Hodge V, Austin J. 2004. A survey of outlier detection methodologies. Artific Intel Rev, 22(2): 85-126.
  • Hofmann M, Klinkenberg R. 2014. RapidMiner: Data mining use cases and business analytics applications. CRC Press, New York, US, pp: 528.
  • Hu Y, Murray W, Australia YS. 2015. Rlof: R parallel implementation of local outlier factor (LOF). URL: https://CRAN.R-project.org/package=Rlof (accessed date: January 12, 2023).
  • Jimenez J. 2015. abodOutlier: angle-based outlier detection. URL: https://CRAN.R-project.org/package=abodOutlier (accessed date: January 12, 2023).
  • Juarto B. 2023. Breast Cancer classification using outlier detection and variance inflation factor. Eng Math Comp Sci J, 5(1): 17-23.
  • Kaya H, Koymen K. 2008. Data mining concept and application areas. Fırat Univ Doğu Araşt Derg, 6(2): 159-164.
  • Kiruthika S, Sowmyarani CN. 2020. Credit card fraud detection using machine learning and deployment of model in public cloud as a web service. Int J Recent Technol Eng, 9(2): 548-552.
  • Kriegel HP, Schubert M, Zimek A. 2008. Angle-based outlier detection in high-dimensional data. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, Las Vegas, US, pp: 444-452.
  • Leys C, Klein O, Dominicy Y, Ley C. 2017. Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance. J Exp Soc Psychol, 74: 150-156.
  • Liu FT, Ting KM, Zhou ZH. 2008. Isolation forest. Eighth IEEE International Conference on Data Mining, December 15-19, Pisa, Italy, pp: 413-422.
  • Mertler CA, Vannatta RA. 2005. Advanced and multivariate statistical methods: practical application and interpretation, 3rd edition. Glendale, Pyrczak Publishing, Los Angeles, US, pp: 234.
  • Molnar C. 2019. Interpretable machine learning: a guide for making black box models explainable. URL: https://christophm.github.io/interpretable-ml-book/ (accessed date: September 20, 2023).
  • Negi SS. 2020. Early prediction of credit card fraud detection using isolation forest tree and local outlier factor machine learning algorithms. A Project Report of Capstone Project-2. Galgotias University, Uttar Pradesh, India, Act No: 14.
  • Nurunnabi A, West G. 2012. Outlier detection in logistic regression: A quest for reliable knowledge from predictive modeling and classification. IEEE 12th international conference on data mining workshops, December 10, pp: 643-652.
  • Omar AAC, Nassif AB. 2023. Lung cancer prediction using machine learning based feature selection: a comparative study. Advances in Science and Engineering Technology International Conferences (ASET), February 20-23, pp: 1-6.
  • Osborne JW, Amy O. 2004. The power of outliers (and why researchers should always check for them). Pract Asses Res Eval, 9(6): 1-12.
  • Prykhodko S, Prykhodko N, Makarova L, Pukhalevych S. 2018. Application of the squared mahalanobis distance for detecting outliers in multivariate non-Gaussian data. 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET), February 20-24, Lviv-Slavske, Ukraine, pp: 962-965.
  • Rousseeuw PJ, Van Zomeren BC. 1990. Unmasking multivariate outliers and leverage points. J American Stat Assoc, 85(411): 633-639.
  • Sharma DK, Chatterjee M, Kaur G, Vavilala S. 2022. Deep learning applications for disease diagnosis. Academic Press, Cambridge, US, pp: 31-51.
  • Vijayakumar V, Divya NS, Sarojini P, Sonika K. 2020. Isolation forest and local outlier factor for credit card fraud detection system. Int J Eng Adv Technol, 9(4): 261-265.
  • Xu X, Liu H, Li L, Yao M. 2018. A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intel Syst, 11(1): 652-662.
  • Yadav J. Sharma M. 2013. A review of k-mean algorithm. Int J Eng Trends Technol, 4(7): 2972-2976.
  • Yucel Altay S. 2014. Using of spatio-temporal data mining for trajectory outlier detection and interpretation in health care services. MS Thesis, Atatürk University, Graduate School of Natural and Applied Sciences, Erzurum, Türkiye, pp: 25-32.
  • Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, Norton GJ, Islam MR, Reynolds A, Mezey J, McClung AM, Bustamante CD, McCouch SR. 2011. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Commun, 2(1): 467.
There are 39 citations in total.

Details

Primary Language English
Subjects Agricultural Engineering (Other)
Journal Section Research Articles
Authors

Melis Çelik Güney 0000-0002-6825-6884

Gökhan Tamer Kayaalp 0000-0003-2193-848X

Project Number FDK-2018 10287
Early Pub Date February 1, 2024
Publication Date March 15, 2024
Submission Date November 22, 2023
Acceptance Date January 8, 2024
Published in Issue Year 2024 Volume: 7 Issue: 2

Cite

APA Çelik Güney, M., & Kayaalp, G. T. (2024). A Comparative Evaluation of the Outlier Detection Methods. Black Sea Journal of Engineering and Science, 7(2), 155-159. https://doi.org/10.34248/bsengineering.1387431
AMA Çelik Güney M, Kayaalp GT. A Comparative Evaluation of the Outlier Detection Methods. BSJ Eng. Sci. March 2024;7(2):155-159. doi:10.34248/bsengineering.1387431
Chicago Çelik Güney, Melis, and Gökhan Tamer Kayaalp. “A Comparative Evaluation of the Outlier Detection Methods”. Black Sea Journal of Engineering and Science 7, no. 2 (March 2024): 155-59. https://doi.org/10.34248/bsengineering.1387431.
EndNote Çelik Güney M, Kayaalp GT (March 1, 2024) A Comparative Evaluation of the Outlier Detection Methods. Black Sea Journal of Engineering and Science 7 2 155–159.
IEEE M. Çelik Güney and G. T. Kayaalp, “A Comparative Evaluation of the Outlier Detection Methods”, BSJ Eng. Sci., vol. 7, no. 2, pp. 155–159, 2024, doi: 10.34248/bsengineering.1387431.
ISNAD Çelik Güney, Melis - Kayaalp, Gökhan Tamer. “A Comparative Evaluation of the Outlier Detection Methods”. Black Sea Journal of Engineering and Science 7/2 (March 2024), 155-159. https://doi.org/10.34248/bsengineering.1387431.
JAMA Çelik Güney M, Kayaalp GT. A Comparative Evaluation of the Outlier Detection Methods. BSJ Eng. Sci. 2024;7:155–159.
MLA Çelik Güney, Melis and Gökhan Tamer Kayaalp. “A Comparative Evaluation of the Outlier Detection Methods”. Black Sea Journal of Engineering and Science, vol. 7, no. 2, 2024, pp. 155-9, doi:10.34248/bsengineering.1387431.
Vancouver Çelik Güney M, Kayaalp GT. A Comparative Evaluation of the Outlier Detection Methods. BSJ Eng. Sci. 2024;7(2):155-9.

                                                24890