Affiliations 

  • 1 Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad, Iran
  • 2 Department of Statistics, Faculty of Mathematics, Statistics and Computer Sciences, Semnan University, Semnan, Iran
  • 3 UM Centre of Data Analytics, Institute of Mathematical Sciences, University of Malaya, Kuala Lumpur, Malaysia
  • 4 Faculty of Mathematics, Polytechnic of Torino University, Torino, Italy
PLoS One, 2021;16(4):e0245376.
PMID: 33831027 DOI: 10.1371/journal.pone.0245376

Abstract

With the advancement of technology, analysis of large-scale data of gene expression is feasible and has become very popular in the era of machine learning. This paper develops an improved ridge approach for the genome regression modeling. When multicollinearity exists in the data set with outliers, we consider a robust ridge estimator, namely the rank ridge regression estimator, for parameter estimation and prediction. On the other hand, the efficiency of the rank ridge regression estimator is highly dependent on the ridge parameter. In general, it is difficult to provide a satisfactory answer about the selection for the ridge parameter. Because of the good properties of generalized cross validation (GCV) and its simplicity, we use it to choose the optimum value of the ridge parameter. The GCV function creates a balance between the precision of the estimators and the bias caused by the ridge estimation. It behaves like an improved estimator of risk and can be used when the number of explanatory variables is larger than the sample size in high-dimensional problems. Finally, some numerical illustrations are given to support our findings.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.