Affiliations 

  • 1 Faculty of Computer Science & Information Technology, Universiti Malaya, Kuala Lumpur, 50603 Malaysia; School of Energy and Intelligence Engineering, Henan University of Animal Husbandry and Economy, #6 North Longzihu Rd, Zhengzhou 450000, China. Electronic address: mingzhe.xu64@gmail.com
  • 2 Faculty of Computer Science & Information Technology, Universiti Malaya, Kuala Lumpur, 50603 Malaysia. Electronic address: noraniza@um.edu.my
  • 3 Faculty of Computer Science & Information Technology, Universiti Malaya, Kuala Lumpur, 50603 Malaysia. Electronic address: aznulqalid@um.edu.my
Comput Biol Chem, 2024 Feb;108:107997.
PMID: 38154318 DOI: 10.1016/j.compbiolchem.2023.107997

Abstract

This work focuses on data sampling in cancer-gene association prediction. Currently, researchers are using machine learning methods to predict genes that are more likely to produce cancer-causing mutations. To improve the performance of machine learning models, methods have been proposed, one of which is to improve the quality of the training data. Existing methods focus mainly on positive data, i.e. cancer driver genes, for screening selection. This paper proposes a low-cancer-related gene screening method based on gene network and graph theory algorithms to improve the negative samples selection. Genetic data with low cancer correlation is used as negative training samples. After experimental verification, using the negative samples screened by this method to train the cancer gene classification model can improve prediction performance. The biggest advantage of this method is that it can be easily combined with other methods that focus on enhancing the quality of positive training samples. It has been demonstrated that significant improvement is achieved by combining this method with three state-of-the-arts cancer gene prediction methods.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.