Accurate identification of distant, large, and frequent sources of emission in cities is a complex procedure due to the presence of large-sized pollutants and the existence of many land use types. This study aims to simplify and optimize the visualization mechanism of long time-series of air pollution data, particularly for urban areas, which is naturally correlated in time and spatially complicated to analyze. Also, we elaborate different sources of pollution that were hitherto undetectable using ordinary plot models by leveraging recent advances in ensemble statistical approaches. The high performing conditional bivariate probability function (CBPF) and time-series signature were integrated within the R programming environment to facilitate the study's analysis. Hourly air pollution data for the period between 2007 to 2016 is collected using four air quality stations, (ca0016, ca0058, ca0054, and ca0025), situated in highly urbanized locations that are characterized by complex land use and high pollution emitting activities. A conditional bivariate probability function (CBPF) was used to analyze the data, utilizing pollutant concentration values such as Sulfur dioxide (SO2), Nitrogen oxides (NO2), Carbon monoxide (CO) and Particulate Matter (PM10) as a third variable plotted on the radial axis, with wind direction and wind speed variables. Generalized linear model (GLM) and sensitivity analysis are applied to verify and visualize the relationship between Air Pollution Index (API) of PM10 and other significant pollutants of GML outputs based on quantile values. To address potential future challenges, we forecast 3 months PM10 values using a Time Series Signature statistical algorithm with time functions and validated the outcome in the 4 stations. Analysis of results reveals that sources emitting PM10 have similar activities producing other pollutants (SO2, CO, and NO2). Therefore, these pollutants can be detected by cross selection between the pollution sources in the affected city. The directional results of CBPF plot indicate that ca0058 and ca0054 enable easier detection of pollutants' sources in comparison to ca0016 and ca0025 due to being located on the edge of industrial areas. This study's CBPF technique and time series signature analysis' outcomes are promising, successfully elaborating different sources of pollution that were hitherto undetectable using ordinary plot models and thus contribute to existing air quality assessment and enhancement mechanisms.
This study investigates uncertainty in machine learning that can occur when there is significant variance in the prediction importance level of the independent variables, especially when the ROC fails to reflect the unbalanced effect of prediction variables. A variable drop-off loop function, based on the concept of early termination for reduction of model capacity, regularization, and generalization control, was tested. A susceptibility index for airborne particulate matter of less than 10 μm diameter (PM10) was modeled using monthly maximum values and spectral bands and indices from Landsat 8 imagery, and Open Street Maps were used to prepare a range of independent variables. Probability and classification index maps were prepared using extreme-gradient boosting (XGBOOST) and random forest (RF) algorithms. These were assessed against utility criteria such as a confusion matrix of overall accuracy, quantity of variables, processing delay, degree of overfitting, importance distribution, and area under the receiver operating characteristic curve (ROC).