Integration of machine learning-based prediction for enhanced Model's generalization: Application in photocatalytic polishing of palm oil mill effluent (POME)

Ng KH; Gan YS; Cheng CK; Liu KH; Liong ST

doi:10.1016/j.envpol.2020.115500

Integration of machine learning-based prediction for enhanced Model's generalization: Application in photocatalytic polishing of palm oil mill effluent (POME)

Ng KH ¹ , Gan YS ² , Cheng CK ³ , Liu KH ⁴ , Liong ST ⁵

Affiliations

¹ College of Chemical Engineering, Fuzhou University, Fuzhou, 350116, PR China; School of Energy and Chemical Engineering, Xiamen University Malaysia, Selangor Darul Ehsan, 43900, Malaysia
² School of Architecture, Feng Chia University, Taichung, 407, Taiwan
³ Department of Chemical Engineering, College of Engineering, Khalifa University, P. O. Box 127788, Abu Dhabi, United Arab Emirates
⁴ School of Informatics, Xiamen University, Xiamen, 361005, China
⁵ Department of Electronic Engineering, Feng Chia University, Taichung, 407, Taiwan. Electronic address: stliong@fcu.edu.tw

Environ Pollut, 2020 Dec;267:115500.

PMID: 33254722 DOI: 10.1016/j.envpol.2020.115500

Abstract

In predicting palm oil mill effluent (POME) degradation efficiency, previous developed quadratic model quantitatively evaluated the effects of O2 flowrate, TiO2 loadings and initial concentration of POME in labscale photocatalytic system, which however suffered from low generalization due to the overfitting behaviour. Evidently, high RMSE (131.61) and low R2 (-630.49) obtained indicates its insufficiency in describing POME degradation at unseen factor ranges, hence verified the fact of poor generalization. To overcome this issue, several models were developed via machine learning-assisted techniques, namely Gaussian Process Regression (GPR), Linear Regression (LR), Decision Tree (DT), Supported Vector Machine (SVM) and Regression Tree Ensemble (RTE), subsequently being assessed systematically. To achieve high generalization, all models were subjected to 'train-all-test-all' strategy, 5-fold and 10-fold cross validation. Specifically, GPR model was furnished with high accuracy in 'train-all-test-all' strategy, judging from its low RMSE (1.0394) and high R2 (0.9962), which however menaced by the risk of overfitting. In contrast, despite relatively poorer RMSE and R2 (1.7964 and 0.9886) obtained in 5-fold cross validation, GPR model was rendered with highest generalization, while sufficiently preserving its accuracy in development process. Besides, SVM and RTE models were also demonstrated promising R2 (0.9372 and 0.9208), which however shadowed by their high RMSEs (4.2174 and 4.7366). Furthermore, the extraordinary generalization of GPR model was coincidentally verified in 10-fold cross validation. The lowest RMSE (2.1624) and highest R2 (0.9835) obtained with feature number of 36 asserted its sufficiency in both generalization and accuracy prospect. Other models were all rendered with slight lower R2 (> 0.9), plausibly due to the higher RMSE (> 4.0). According to GPR model, optimized POME degradation (52.52%) can be obtained at 70 mL/min of O2, 70.0 g/L of TiO2 and 250 ppm of POME concentration, with only ∼3% error as compared to the actual data.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.

MeSH terms

Similar publications