Cancer classification and gene selection in high-dimensional data have been popular research topics in genetics and molecular biology. Recently, adaptive regularized logistic regression using the elastic net regularization, which is called the adaptive elastic net, has been successfully applied in high-dimensional cancer classification to tackle both estimating the gene coefficients and performing gene selection simultaneously. The adaptive elastic net originally used elastic net estimates as the initial weight, however, using this weight may not be preferable for certain reasons: First, the elastic net estimator is biased in selecting genes. Second, it does not perform well when the pairwise correlations between variables are not high. Adjusted adaptive regularized logistic regression (AAElastic) is proposed to address these issues and encourage grouping effects simultaneously. The real data results indicate that AAElastic is significantly consistent in selecting genes compared to the other three competitor regularization methods. Additionally, the classification performance of AAElastic is comparable to the adaptive elastic net and better than other regularization methods. Thus, we can conclude that AAElastic is a reliable adaptive regularized logistic regression method in the field of high-dimensional cancer classification.
A high-dimensional quantitative structure-activity relationship (QSAR) classification model typically contains a large number of irrelevant and redundant descriptors. In this paper, a new design of descriptor selection for the QSAR classification model estimation method is proposed by adding a new weight inside L1-norm. The experimental results of classifying the anti-hepatitis C virus activity of thiourea derivatives demonstrate that the proposed descriptor selection method in the QSAR classification model performs effectively and competitively compared with other existing penalized methods in terms of classification performance on both the training and the testing datasets. Moreover, it is noteworthy that the results obtained in terms of stability test and applicability domain provide a robust QSAR classification model. It is evident from the results that the developed QSAR classification model could conceivably be employed for further high-dimensional QSAR classification studies.
One of the most challenging issues when facing a Quantitative structure-activity relationship (QSAR) classification model is to deal with the descriptor selection. Penalized methods have been adapted and have gained popularity as a key for simultaneously performing descriptor selection and QSAR classification model estimation. However, penalized methods have drawbacks such as having biases and inconsistencies that make they lack the oracle properties. This paper proposes an adaptive penalized logistic regression (APLR) to overcome these drawbacks. This is done by employing a ratio (BWR) of the descriptors between-groups sum of squares (BSS) to the within-groups sum of squares (WSS) for each descriptor as a weight inside the L1-norm. The proposed method was applied to one dataset that consists of a diverse series of antimicrobial agents with their respective bioactivities against Candida albicans. By experimental study, it has been shown that the proposed method (APLR) was more efficient in the selection of descriptors and classification accuracy than the other competitive methods that could be used in developing QSAR classification models. Another dataset was also successfully experienced. Therefore, it can be concluded that the APLR method had significant impact on QSAR analysis and studies.
In high-dimensional quantitative structure-activity relationship (QSAR) modelling, penalization methods have been a popular choice to simultaneously address molecular descriptor selection and QSAR model estimation. In this study, a penalized linear regression model with L1/2-norm is proposed. Furthermore, the local linear approximation algorithm is utilized to avoid the non-convexity of the proposed method. The potential applicability of the proposed method is tested on several benchmark data sets. Compared with other commonly used penalized methods, the proposed method can not only obtain the best predictive ability, but also provide an easily interpretable QSAR model. In addition, it is noteworthy that the results obtained in terms of applicability domain and Y-randomization test provide an efficient and a robust QSAR model. It is evident from the results that the proposed method may possibly be a promising penalized method in the field of computational chemistry research, especially when the number of molecular descriptors exceeds the number of compounds.
High-dimensionality is one of the major problems which affect the quality of the quantitative structure-activity relationship (QSAR) modelling. Obtaining a reliable QSAR model with few descriptors is an essential procedure in chemometrics. The binary grasshopper optimization algorithm (BGOA) is a new meta-heuristic optimization algorithm, which has been used successfully to perform feature selection. In this paper, four new transfer functions were adapted to improve the exploration and exploitation capability of the BGOA in QSAR modelling of influenza A viruses (H1N1). The QSAR model with these new quadratic transfer functions was internally and externally validated based on MSEtrain, Y-randomization test, MSEtest, and the applicability domain (AD). The validation results indicate that the model is robust and not due to chance correlation. In addition, the results indicate that the descriptor selection and prediction performance of the QSAR model for training dataset outperform the other S-shaped and V-shaped transfer functions. QSAR model using quadratic transfer function shows the lowest MSEtrain. For the test dataset, proposed QSAR model shows lower value of MSEtest compared with the other methods, indicating its higher predictive ability. In conclusion, the results reveal that the proposed QSAR model is an efficient approach for modelling high-dimensional QSAR models and it is useful for the estimation of IC50 values of neuraminidase inhibitors that have not been experimentally tested.
A robust screening approach and a sparse quantitative structure-retention relationship (QSRR) model for predicting retention indices (RIs) of 169 constituents of essential oils is proposed. The proposed approach is represented in two steps. First, dimension reduction was performed using the proposed modified robust sure independence screening (MR-SIS) method. Second, prediction of RIs was made using the proposed robust sparse QSRR with smoothly clipped absolute deviation (SCAD) penalty (RSQSRR). The RSQSRR model was internally and externally validated based on [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], Y-randomization test, [Formula: see text], [Formula: see text], and the applicability domain. The validation results indicate that the model is robust and not due to chance correlation. The descriptor selection and prediction performance of the RSQSRR for training dataset outperform the other two used modelling methods. The RSQSRR shows the highest [Formula: see text], [Formula: see text], and [Formula: see text], and the lowest [Formula: see text]. For the test dataset, the RSQSRR shows a high external validation value ([Formula: see text]), and a low value of [Formula: see text] compared with the other methods, indicating its higher predictive ability. In conclusion, the results reveal that the proposed RSQSRR is an efficient approach for modelling high dimensional QSRRs and the method is useful for the estimation of RIs of essential oils that have not been experimentally tested.
A penalized quantitative structure-property relationship (QSPR) model with adaptive bridge penalty for predicting the melting points of 92 energetic carbocyclic nitroaromatic compounds is proposed. To ensure the consistency of the descriptor selection of the proposed penalized adaptive bridge (PBridge), we proposed a ridge estimator ([Formula: see text]) as an initial weight in the adaptive bridge penalty. The Bayesian information criterion was applied to ensure the accurate selection of the tuning parameter ([Formula: see text]). The PBridge based model was internally and externally validated based on [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], the Y-randomization test, [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text] and the applicability domain. The validation results indicate that the model is robust and not due to chance correlation. The descriptor selection and prediction performance of PBridge for the training dataset outperforms the other methods used. PBridge shows the highest [Formula: see text] of 0.959, [Formula: see text] of 0.953, [Formula: see text] of 0.949 and [Formula: see text] of 0.959, and the lowest [Formula: see text] and [Formula: see text]. For the test dataset, PBridge shows a higher [Formula: see text] of 0.945 and [Formula: see text] of 0.948, and a lower [Formula: see text] and [Formula: see text], indicating its better prediction performance. The results clearly reveal that the proposed PBridge is useful for constructing reliable and robust QSPRs for predicting melting points prior to synthesizing new organic compounds.
Time-varying binary gravitational search algorithm (TVBGSA) is proposed for predicting antidiabetic activity of 134 dipeptidyl peptidase-IV (DPP-IV) inhibitors. To improve the performance of the binary gravitational search algorithm (BGSA) method, we propose a dynamic time-varying transfer function. A new control parameter, μ , is added in the original transfer function as a time-varying variable. The TVBGSA-based model was internally and externally validated based on
Q
int
2
,
Q
L G O
2
,
Q
B o o t
2
,
M S
E
t r a i n
,
Q
e x t
2
,
M S
E
t e s t
, Y-randomization test, and applicability domain evaluation. The validation results indicate that the proposed TVBGSA model is robust and not due to chance correlation. The descriptor selection and prediction performance of TVBGSA outperform BGSA method. TVBGSA shows higher
Q
int
2
of 0.957,
Q
L G O
2
of 0.951,
Q
B o o t
2
of 0.954,
Q
e x t
2
of 0.938, and lower
M S
E
t r a i n
and
M S
E
t e s t
compared to obtained results by BGSA, indicating the best prediction performance of the proposed TVBGSA model. The results clearly reveal that the proposed TVBGSA method is useful for constructing reliable and robust QSARs for predicting antidiabetic activity of DPP-IV inhibitors prior to designing and experimental synthesizing of new DPP-IV inhibitors.
An improved binary differential search (improved BDS) algorithm is proposed for QSAR classification of diverse series of antimicrobial compounds against Candida albicans inhibitors. The transfer functions is the most important component of the BDS algorithm, and converts continuous values of the donor into discrete values. In this paper, the eight types of transfer functions are investigated to verify their efficiency in improving BDS algorithm performance in QSAR classification. The performance was evaluated using three metrics: classification accuracy (CA), geometric mean of sensitivity and specificity (G-mean), and area under the curve. The Kruskal-Wallis test was also applied to show the statistical differences between the functions. Two functions, S1 and V4, show the best classification achievement, with a slightly better performance of V4 than S1. The V4 function takes the lowest iterations and selects the fewest descriptors. In addition, the V4 function yields the best CA and G-mean of 98.07% and 0.977%, respectively. The results prove that the V4 transfer function significantly improves the performance of the original BDS.
One of the recently developed metaheuristic algorithms, the coyote optimization algorithm (COA), has shown to perform better in a number of difficult optimization tasks. The binary form, BCOA, is used in this study as a solution to the descriptor selection issue in classifying diverse antifungal series. Z-shape transfer functions (ZTF) are evaluated to verify their efficiency in improving BCOA performance in QSAR classification based on classification accuracy (CA), the geometric mean of sensitivity and specificity (G-mean), and the area under the curve (AUC). The Kruskal-Wallis test is also applied to show the statistical differences between the functions. The efficacy of the best suggested transfer function, ZTF4, is further assessed by comparing it to the most recent binary algorithms. The results prove that ZTF, especially ZTF4, significantly improves the performance of the original BCOA. The ZTF4 function yields the best CA and G-mean of 99.03% and 0.992%, respectively. It shows the fastest convergence behaviour compared to other binary algorithms. It takes the fewest iterations to reach high classification performance and selects the fewest descriptors. In conclusion, the obtained results indicate the ability of the ZTF4-based BCOA to find the smallest subset of descriptors while maintaining the best classification accuracy performance.
The horse herd optimization algorithm (HOA), one of the more contemporary metaheuristic algorithms, has demonstrated superior performance in a number of challenging optimization tasks. In the present work, the descriptor selection issue is resolved by classifying different essential oil retention indices using the binary form, BHOA. Based on internal and external prediction criteria, Z-shape transfer functions (ZTF) were tested to verify their efficiency in improving BHOA performance in QSPR modelling for predicting retention indices of essential oils. The evaluation criteria involved the mean-squared error of the training and testing datasets (MSE), and leave-one-out internal and external validation (Q2). The degree of convergence of the proposed Z-shaped transfer functions was compared. In addition, K-fold cross validation with k = 5 was applied. The results show that ZTF, especially ZTF1, greatly improves the performance of the original BHOA. Comparatively speaking, ZTF, especially ZTF1, exhibits the fastest convergence behaviour of the binary algorithms. It chooses the fewest descriptors and requires the fewest iterations to achieve excellent prediction performance.