Affiliations 

  • 1 New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
  • 2 Business and Management Sciences Department, Purdue University, West Lafayette, IN, USA
  • 3 Department of Computer Science, Allama Iqbal Open University, Islamabad, Pakistan
  • 4 Institute of Oceanography and Environment (INOS), Universiti Malaysia Terengganu, Kuala Nerus, Terengganu, 21030, Malaysia
  • 5 Department of Computer Science, Khurasan University Jalalabad, Jalalabad, Afghanistan. Nijad@khurasan.edu.af
BioData Min, 2025 Feb 03;18(1):12.
PMID: 39901279 DOI: 10.1186/s13040-024-00415-8

Abstract

Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.