Short Message Service (SMS) spam poses significant risks, including financial scams and phishing attempts. Although numerous datasets from online repositories have been utilized to address this issue, little attention has been given to evaluating their effectiveness and impact on SMS spam detection models. This study fills this gap by assessing the performance of ten SMS spam detection datasets using Decision Tree and Multinomial Naïve Bayes models. Datasets were evaluated based on accuracy and qualitative factors such as authenticity, class imbalance, feature diversity, metadata availability, and preprocessing needs. Due to the multilingual nature of the datasets, experiments were conducted with two stopword removal groups: one in English and another in the respective non-English languages. The key findings of this research have led to the recommendation of Dataset 5 for future SMS spam detection research, as evidence from the dataset's high qualitative assessment score of 3.8 out of 5.0 due to its high feature diversity, real-world complexity, and balanced class distribution, and low detection rate of 86.10% from Multinomial Naïve Bayes. Recommending a dataset that poses challenges for high model performance fosters the development of more robust and adaptable spam detection models capable of handling diverse forms of noise and ambiguity. Furthermore, selecting the dataset with the highest qualitative score enhances research quality, improves model generalizability, and mitigates risks related to bias and inconsistencies.
* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.