Using and Comparing Machine Learning Techniques for Automatic Detection of Spam Website URLs

Muhammed Yıldırım

doi:10.46572/naturengs.1097970

Research Article

Using and Comparing Machine Learning Techniques for Automatic Detection of Spam Website URLs

Year 2022, Volume: 3 Issue: 1, 33 - 41, 30.06.2022

Muhammed Yıldırım

https://doi.org/10.46572/naturengs.1097970

Abstract

With the developing technology, the issue of cyber security has become one of the most common and current issues in recent years. Spam URLs are one of the most common and dangerous issues for cybersecurity. Spam URLs are one of the most widely used attacks to defraud users. These attacks cause users to suffer monetary losses, steal private information, and install malicious software on their devices. It is very important to detect such threats promptly and to take precautions against these threats. Detection of malicious URLs is mostly done by using blacklists. However, these lists are insufficient to detect newly created URLs. In recent years, machine learning techniques have been developed to overcome this deficiency. In this study, URL classification was made using different machine learning techniques. In the study, 9 different classifiers were preferred for URL classification. The performances of the classifiers were compared in the URL classification process. In addition, similar studies in the literature have been comprehensively examined and these studies have been discussed. In addition, since the preparation of data sets in the natural language processing process has a great effect on the training of models, these steps are discussed in detail.

Keywords

Cyber Security, Machine Learning, NLP, URL Detection, Classifiers

References

[1] Adam, E.E.B., Deep learning based NLP techniques in text to speech synthesis for communication recognition. Journal of Soft Computing Paradigm (JSCP), 2020. 2(04): p. 209-215.
[2] Rajput, A., Natural language processing, sentiment analysis, and clinical analytics, in Innovation in Health Informatics. 2020, Elsevier. p. 79-97.
[3] Arthur, M.P., Automatic source code documentation using code summarization technique of NLP. Procedia Computer Science, 2020. 171: p. 2522-2531.
[4] Widyassari, A.P., et al., Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences, 2020.
[5] Nemes, L. and A. Kiss, Social media sentiment analysis based on COVID-19. Journal of Information and Telecommunication, 2021. 5(1): p. 1-15.
[6] Neysiani, B.S. and S.M. Babamir. Effect of Typos Correction on the validation performance of Duplicate Bug Reports Detection. in 10th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran. 2020.
[7] Rivera-Trigueros, I., Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 2021: p. 1-27.
[8] Popovski, G., B.K. Seljak, and T. Eftimov, A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020. 8: p. 31586-31594.
[9] Lai, C.-M., H. Shiu Jr, and J. Chapman, Quantifiable Interactivity of Malicious URLs and the Social Media Ecosystem. Electronics, 2020. 9(12).
[10] Chen, Q., et al. Detecting filter list evasion with event-loop-turn granularity javascript signatures. in 2021 IEEE Symposium on Security and Privacy (SP). 2021. IEEE.
[11] Thanaki, J., Python natural language processing. 2017: Packt Publishing Ltd.
[12] Do Xuan, C., H.D. Nguyen, and T.V. Nikolaevich, Malicious URL detection based on machine learning. International Journal of Advanced Computer Science and Applications, 2020. 11(1).
[13] Patgiri, R., et al. Empirical study on malicious URL detection using machine learning. in International Conference on Distributed Computing and Internet Technology. 2019. Springer.
[14] Jain, A.K. and B. Gupta, PHISH-SAFE: URL features-based phishing detection system using machine learning, in Cyber Security. 2018, Springer. p. 467-474.
[15] Joshi, A., et al., Using lexical features for malicious URL detection--a machine learning approach. arXiv preprint arXiv:1910.06277, 2019.
[16] Goh, K.L. and A.K. Singh, Comprehensive literature review on machine learning structures for web spam classification. Procedia Computer Science, 2015. 70: p. 434-441.
[17] Sun, N., et al., Near real-time twitter spam detection with machine learning techniques. International Journal of Computers and Applications, 2020: p. 1-11.
[18] URL-1,https://www.kaggle.com/shivamb/spam-url-prediction, Last Accessed Date: 01.01.2022.
[19] Bingol, H. and B. Alatas. Rumor Detection in Social Media using machine learning methods. in 2019 1st International Informatics and Software Engineering Conference (UBMYK). 2019. IEEE.
[20] Zhang, M.-L. and Z.-H. Zhou, ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition, 2007. 40(7): p. 2038-2048.
[21] Pal, M., Random forest classifier for remote sensing classification. International journal of remote sensing, 2005. 26(1): p. 217-222.
[22] Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001.
[23] Friedman, J.H., Stochastic gradient boosting. Computational statistics & data analysis, 2002. 38(4): p. 367-378.
[24] Klecka, W.R., G.R. Iversen, and W.R. Klecka, Discriminant analysis. Vol. 19. 1980: Sage.
[25] Ke, G., et al., Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 2017. 30.
[26] Wasserman, S. and P. Pattison, Logit models and logistic regressions for social networks: I. An introduction to Markov graphs andp. Psychometrika, 1996. 61(3): p. 401-425.
[27] Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[28] Suykens, J.A. and J. Vandewalle, Least squares support vector machine classifiers. Neural processing letters, 1999. 9(3): p. 293-300.
[29] Eroglu, Y., et al., Diagnosis and grading of vesicoureteral reflux on voiding cystourethrography images in children using a deep hybrid model. Computer Methods and Programs in Biomedicine, 2021. 210: p. 106369.
[30] Yildirim, M., A. Çinar, and E. Cengİl. Classification of flower species using CNN models, Subspace Discriminant, and NCA. in 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). 2021. IEEE.

Year 2022, Volume: 3 Issue: 1, 33 - 41, 30.06.2022

Muhammed Yıldırım

https://doi.org/10.46572/naturengs.1097970

Abstract

References

[1] Adam, E.E.B., Deep learning based NLP techniques in text to speech synthesis for communication recognition. Journal of Soft Computing Paradigm (JSCP), 2020. 2(04): p. 209-215.
[2] Rajput, A., Natural language processing, sentiment analysis, and clinical analytics, in Innovation in Health Informatics. 2020, Elsevier. p. 79-97.
[3] Arthur, M.P., Automatic source code documentation using code summarization technique of NLP. Procedia Computer Science, 2020. 171: p. 2522-2531.
[4] Widyassari, A.P., et al., Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences, 2020.
[5] Nemes, L. and A. Kiss, Social media sentiment analysis based on COVID-19. Journal of Information and Telecommunication, 2021. 5(1): p. 1-15.
[6] Neysiani, B.S. and S.M. Babamir. Effect of Typos Correction on the validation performance of Duplicate Bug Reports Detection. in 10th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran. 2020.
[7] Rivera-Trigueros, I., Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 2021: p. 1-27.
[8] Popovski, G., B.K. Seljak, and T. Eftimov, A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020. 8: p. 31586-31594.
[9] Lai, C.-M., H. Shiu Jr, and J. Chapman, Quantifiable Interactivity of Malicious URLs and the Social Media Ecosystem. Electronics, 2020. 9(12).
[10] Chen, Q., et al. Detecting filter list evasion with event-loop-turn granularity javascript signatures. in 2021 IEEE Symposium on Security and Privacy (SP). 2021. IEEE.
[11] Thanaki, J., Python natural language processing. 2017: Packt Publishing Ltd.
[12] Do Xuan, C., H.D. Nguyen, and T.V. Nikolaevich, Malicious URL detection based on machine learning. International Journal of Advanced Computer Science and Applications, 2020. 11(1).
[13] Patgiri, R., et al. Empirical study on malicious URL detection using machine learning. in International Conference on Distributed Computing and Internet Technology. 2019. Springer.
[14] Jain, A.K. and B. Gupta, PHISH-SAFE: URL features-based phishing detection system using machine learning, in Cyber Security. 2018, Springer. p. 467-474.
[15] Joshi, A., et al., Using lexical features for malicious URL detection--a machine learning approach. arXiv preprint arXiv:1910.06277, 2019.
[16] Goh, K.L. and A.K. Singh, Comprehensive literature review on machine learning structures for web spam classification. Procedia Computer Science, 2015. 70: p. 434-441.
[17] Sun, N., et al., Near real-time twitter spam detection with machine learning techniques. International Journal of Computers and Applications, 2020: p. 1-11.
[18] URL-1,https://www.kaggle.com/shivamb/spam-url-prediction, Last Accessed Date: 01.01.2022.
[19] Bingol, H. and B. Alatas. Rumor Detection in Social Media using machine learning methods. in 2019 1st International Informatics and Software Engineering Conference (UBMYK). 2019. IEEE.
[20] Zhang, M.-L. and Z.-H. Zhou, ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition, 2007. 40(7): p. 2038-2048.
[21] Pal, M., Random forest classifier for remote sensing classification. International journal of remote sensing, 2005. 26(1): p. 217-222.
[22] Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001.
[23] Friedman, J.H., Stochastic gradient boosting. Computational statistics & data analysis, 2002. 38(4): p. 367-378.
[24] Klecka, W.R., G.R. Iversen, and W.R. Klecka, Discriminant analysis. Vol. 19. 1980: Sage.
[25] Ke, G., et al., Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 2017. 30.
[26] Wasserman, S. and P. Pattison, Logit models and logistic regressions for social networks: I. An introduction to Markov graphs andp. Psychometrika, 1996. 61(3): p. 401-425.
[27] Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[28] Suykens, J.A. and J. Vandewalle, Least squares support vector machine classifiers. Neural processing letters, 1999. 9(3): p. 293-300.
[29] Eroglu, Y., et al., Diagnosis and grading of vesicoureteral reflux on voiding cystourethrography images in children using a deep hybrid model. Computer Methods and Programs in Biomedicine, 2021. 210: p. 106369.
[30] Yildirim, M., A. Çinar, and E. Cengİl. Classification of flower species using CNN models, Subspace Discriminant, and NCA. in 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). 2021. IEEE.

There are 30 citations in total.

Details

Primary Language	English
Journal Section	Research Articles
Authors	Muhammed Yıldırım 0000-0003-1866-4721
Publication Date	June 30, 2022
Submission Date	April 3, 2022
Acceptance Date	May 18, 2022
Published in Issue	Year 2022 Volume: 3 Issue: 1

Cite

APA	Yıldırım, M. (2022). Using and Comparing Machine Learning Techniques for Automatic Detection of Spam Website URLs. NATURENGS, 3(1), 33-41. https://doi.org/10.46572/naturengs.1097970

Download Cover Image

Article Files

Full Text