Affiliations 

  • 1 Department of Business Administration, Westcliff University, 17877 Von Karman Ave 4th Floor, Irvine, CA, 92614, USA. r.hasan.179@westcliff.edu
  • 2 Department of Business Administration, Westcliff University, 17877 Von Karman Ave 4th Floor, Irvine, CA, 92614, USA
  • 3 Department of Business Administration, International American University, 3440 Wilshire Blvd STE 1000, Los Angeles, CA, 90010, USA
  • 4 Department of Biomedical Engineering, Biosensor and Embedded System Lab, Universiti Malaya, Kuala Lumpur, Malaysia
  • 5 Department of Vehicles Engineering, Faculty of Engineering, University of Debrecen, Ótemető Str. 2-4, Debrecen, 4028, Hungary. masuk@eng.unideb.hu
Sci Rep, 2025 Mar 17;15(1):9122.
PMID: 40097688 DOI: 10.1038/s41598-025-93447-x

Abstract

The increasing prevalence of malware presents a critical challenge to cybersecurity, emphasizing the need for robust detection methods. This study uses a binary tabular classification dataset to evaluate the impact of feature selection, feature scaling, and machine learning (ML) models on malware detection. The methodology involves experimenting with three feature scaling techniques (no scaling, normalization, and min-max scaling), three feature selection methods (no selection, Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA)), and twelve ML models, including traditional algorithms and ensemble methods. A publicly available dataset with 11,598 samples and 139 features is utilized, and model performance is assessed using metrics such as accuracy, precision, recall, F1-score, and AUC-ROC. Results reveal that the Light Gradient Boosting Machine (LGBM) achieves the highest accuracy of 97.16% when PCA and either min-max scaling or normalization are applied. Additionally, ensemble models consistently outperform traditional ML models, demonstrating their effectiveness in enhancing malware detection. These findings offer valuable insights into optimizing preprocessing and model selection strategies for developing reliable and efficient malware detection systems.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.