Research Article
BibTex RIS Cite

Using and Comparing Machine Learning Techniques for Automatic Detection of Spam Website URLs

Year 2022, Volume: 3 Issue: 1, 33 - 41, 30.06.2022
https://doi.org/10.46572/naturengs.1097970

Abstract

With the developing technology, the issue of cyber security has become one of the most common and current issues in recent years. Spam URLs are one of the most common and dangerous issues for cybersecurity. Spam URLs are one of the most widely used attacks to defraud users. These attacks cause users to suffer monetary losses, steal private information, and install malicious software on their devices. It is very important to detect such threats promptly and to take precautions against these threats. Detection of malicious URLs is mostly done by using blacklists. However, these lists are insufficient to detect newly created URLs. In recent years, machine learning techniques have been developed to overcome this deficiency. In this study, URL classification was made using different machine learning techniques. In the study, 9 different classifiers were preferred for URL classification. The performances of the classifiers were compared in the URL classification process. In addition, similar studies in the literature have been comprehensively examined and these studies have been discussed. In addition, since the preparation of data sets in the natural language processing process has a great effect on the training of models, these steps are discussed in detail.

References

  • [1] Adam, E.E.B., Deep learning based NLP techniques in text to speech synthesis for communication recognition. Journal of Soft Computing Paradigm (JSCP), 2020. 2(04): p. 209-215.
  • [2] Rajput, A., Natural language processing, sentiment analysis, and clinical analytics, in Innovation in Health Informatics. 2020, Elsevier. p. 79-97.
  • [3] Arthur, M.P., Automatic source code documentation using code summarization technique of NLP. Procedia Computer Science, 2020. 171: p. 2522-2531.
  • [4] Widyassari, A.P., et al., Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences, 2020.
  • [5] Nemes, L. and A. Kiss, Social media sentiment analysis based on COVID-19. Journal of Information and Telecommunication, 2021. 5(1): p. 1-15.
  • [6] Neysiani, B.S. and S.M. Babamir. Effect of Typos Correction on the validation performance of Duplicate Bug Reports Detection. in 10th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran. 2020.
  • [7] Rivera-Trigueros, I., Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 2021: p. 1-27.
  • [8] Popovski, G., B.K. Seljak, and T. Eftimov, A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020. 8: p. 31586-31594.
  • [9] Lai, C.-M., H. Shiu Jr, and J. Chapman, Quantifiable Interactivity of Malicious URLs and the Social Media Ecosystem. Electronics, 2020. 9(12).
  • [10] Chen, Q., et al. Detecting filter list evasion with event-loop-turn granularity javascript signatures. in 2021 IEEE Symposium on Security and Privacy (SP). 2021. IEEE.
  • [11] Thanaki, J., Python natural language processing. 2017: Packt Publishing Ltd.
  • [12] Do Xuan, C., H.D. Nguyen, and T.V. Nikolaevich, Malicious URL detection based on machine learning. International Journal of Advanced Computer Science and Applications, 2020. 11(1).
  • [13] Patgiri, R., et al. Empirical study on malicious URL detection using machine learning. in International Conference on Distributed Computing and Internet Technology. 2019. Springer.
  • [14] Jain, A.K. and B. Gupta, PHISH-SAFE: URL features-based phishing detection system using machine learning, in Cyber Security. 2018, Springer. p. 467-474.
  • [15] Joshi, A., et al., Using lexical features for malicious URL detection--a machine learning approach. arXiv preprint arXiv:1910.06277, 2019.
  • [16] Goh, K.L. and A.K. Singh, Comprehensive literature review on machine learning structures for web spam classification. Procedia Computer Science, 2015. 70: p. 434-441.
  • [17] Sun, N., et al., Near real-time twitter spam detection with machine learning techniques. International Journal of Computers and Applications, 2020: p. 1-11.
  • [18] URL-1,https://www.kaggle.com/shivamb/spam-url-prediction, Last Accessed Date: 01.01.2022.
  • [19] Bingol, H. and B. Alatas. Rumor Detection in Social Media using machine learning methods. in 2019 1st International Informatics and Software Engineering Conference (UBMYK). 2019. IEEE.
  • [20] Zhang, M.-L. and Z.-H. Zhou, ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition, 2007. 40(7): p. 2038-2048.
  • [21] Pal, M., Random forest classifier for remote sensing classification. International journal of remote sensing, 2005. 26(1): p. 217-222.
  • [22] Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001.
  • [23] Friedman, J.H., Stochastic gradient boosting. Computational statistics & data analysis, 2002. 38(4): p. 367-378.
  • [24] Klecka, W.R., G.R. Iversen, and W.R. Klecka, Discriminant analysis. Vol. 19. 1980: Sage.
  • [25] Ke, G., et al., Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 2017. 30.
  • [26] Wasserman, S. and P. Pattison, Logit models and logistic regressions for social networks: I. An introduction to Markov graphs andp. Psychometrika, 1996. 61(3): p. 401-425.
  • [27] Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
  • [28] Suykens, J.A. and J. Vandewalle, Least squares support vector machine classifiers. Neural processing letters, 1999. 9(3): p. 293-300.
  • [29] Eroglu, Y., et al., Diagnosis and grading of vesicoureteral reflux on voiding cystourethrography images in children using a deep hybrid model. Computer Methods and Programs in Biomedicine, 2021. 210: p. 106369.
  • [30] Yildirim, M., A. Çinar, and E. Cengİl. Classification of flower species using CNN models, Subspace Discriminant, and NCA. in 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). 2021. IEEE.
Year 2022, Volume: 3 Issue: 1, 33 - 41, 30.06.2022
https://doi.org/10.46572/naturengs.1097970

Abstract

References

  • [1] Adam, E.E.B., Deep learning based NLP techniques in text to speech synthesis for communication recognition. Journal of Soft Computing Paradigm (JSCP), 2020. 2(04): p. 209-215.
  • [2] Rajput, A., Natural language processing, sentiment analysis, and clinical analytics, in Innovation in Health Informatics. 2020, Elsevier. p. 79-97.
  • [3] Arthur, M.P., Automatic source code documentation using code summarization technique of NLP. Procedia Computer Science, 2020. 171: p. 2522-2531.
  • [4] Widyassari, A.P., et al., Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences, 2020.
  • [5] Nemes, L. and A. Kiss, Social media sentiment analysis based on COVID-19. Journal of Information and Telecommunication, 2021. 5(1): p. 1-15.
  • [6] Neysiani, B.S. and S.M. Babamir. Effect of Typos Correction on the validation performance of Duplicate Bug Reports Detection. in 10th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran. 2020.
  • [7] Rivera-Trigueros, I., Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 2021: p. 1-27.
  • [8] Popovski, G., B.K. Seljak, and T. Eftimov, A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020. 8: p. 31586-31594.
  • [9] Lai, C.-M., H. Shiu Jr, and J. Chapman, Quantifiable Interactivity of Malicious URLs and the Social Media Ecosystem. Electronics, 2020. 9(12).
  • [10] Chen, Q., et al. Detecting filter list evasion with event-loop-turn granularity javascript signatures. in 2021 IEEE Symposium on Security and Privacy (SP). 2021. IEEE.
  • [11] Thanaki, J., Python natural language processing. 2017: Packt Publishing Ltd.
  • [12] Do Xuan, C., H.D. Nguyen, and T.V. Nikolaevich, Malicious URL detection based on machine learning. International Journal of Advanced Computer Science and Applications, 2020. 11(1).
  • [13] Patgiri, R., et al. Empirical study on malicious URL detection using machine learning. in International Conference on Distributed Computing and Internet Technology. 2019. Springer.
  • [14] Jain, A.K. and B. Gupta, PHISH-SAFE: URL features-based phishing detection system using machine learning, in Cyber Security. 2018, Springer. p. 467-474.
  • [15] Joshi, A., et al., Using lexical features for malicious URL detection--a machine learning approach. arXiv preprint arXiv:1910.06277, 2019.
  • [16] Goh, K.L. and A.K. Singh, Comprehensive literature review on machine learning structures for web spam classification. Procedia Computer Science, 2015. 70: p. 434-441.
  • [17] Sun, N., et al., Near real-time twitter spam detection with machine learning techniques. International Journal of Computers and Applications, 2020: p. 1-11.
  • [18] URL-1,https://www.kaggle.com/shivamb/spam-url-prediction, Last Accessed Date: 01.01.2022.
  • [19] Bingol, H. and B. Alatas. Rumor Detection in Social Media using machine learning methods. in 2019 1st International Informatics and Software Engineering Conference (UBMYK). 2019. IEEE.
  • [20] Zhang, M.-L. and Z.-H. Zhou, ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition, 2007. 40(7): p. 2038-2048.
  • [21] Pal, M., Random forest classifier for remote sensing classification. International journal of remote sensing, 2005. 26(1): p. 217-222.
  • [22] Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001.
  • [23] Friedman, J.H., Stochastic gradient boosting. Computational statistics & data analysis, 2002. 38(4): p. 367-378.
  • [24] Klecka, W.R., G.R. Iversen, and W.R. Klecka, Discriminant analysis. Vol. 19. 1980: Sage.
  • [25] Ke, G., et al., Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 2017. 30.
  • [26] Wasserman, S. and P. Pattison, Logit models and logistic regressions for social networks: I. An introduction to Markov graphs andp. Psychometrika, 1996. 61(3): p. 401-425.
  • [27] Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
  • [28] Suykens, J.A. and J. Vandewalle, Least squares support vector machine classifiers. Neural processing letters, 1999. 9(3): p. 293-300.
  • [29] Eroglu, Y., et al., Diagnosis and grading of vesicoureteral reflux on voiding cystourethrography images in children using a deep hybrid model. Computer Methods and Programs in Biomedicine, 2021. 210: p. 106369.
  • [30] Yildirim, M., A. Çinar, and E. Cengİl. Classification of flower species using CNN models, Subspace Discriminant, and NCA. in 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). 2021. IEEE.
There are 30 citations in total.

Details

Primary Language English
Journal Section Research Articles
Authors

Muhammed Yıldırım 0000-0003-1866-4721

Publication Date June 30, 2022
Submission Date April 3, 2022
Acceptance Date May 18, 2022
Published in Issue Year 2022 Volume: 3 Issue: 1

Cite

APA Yıldırım, M. (2022). Using and Comparing Machine Learning Techniques for Automatic Detection of Spam Website URLs. NATURENGS, 3(1), 33-41. https://doi.org/10.46572/naturengs.1097970