Arsenic contamination in groundwater is naturally occurred and affecting millions of populations around the world.
Machine learning algorithms were utilized to predict As contamination in groundwater.
Four machine learning algorithms were included to build the models.
The GBM and RF are the best performing models to predict As groundwater contamination.
The excessive presence of Arsenic (As) level in groundwater which is often a product of natural processes, is a major health problem that affects several hundred million people around the world (Winkel et al., 2011) (Ravenscroft et al., 2011). This issue is mainly caused by the daily consumption of groundwater which contains elevated levels of As for most populations. Long-term exposure of arsenic to the human body through drinking water may lead to arsenicosis, which includes symptoms such as muscular weakness, skin pigmentation, and painful skin lesions, among others (Ravenscroft et al., 2011). As exposure also causes certain cancers and diseases of the liver, kidney, heart, brain, and lungs (Ravenscroft et al., 2011). Therefore, treatment of As in water is an emerging issue (Lee et al., 2020) (Kwon et al., 2020) (Choi et al., 2021).
Vietnam is one of developing countries in Southeast Asia which also ranked 15th most populous country in the world, with a 96.5 million population in 2019 (UN, 2019). According to Carrard et al. (2019), more than 55 million people in Vietnam rely on groundwater as a source of drinking and daily purposes. It is also discovered that three million people in Red River Delta are exposed to high level of As (>10 µgL−1, WHO standard), in addition to one million others consuming approximately >50 µgL−1 As levels daily through drinking untreated groundwater sourced from private tube wells (Winkel et al., 2011) (Berg et al., 2001). The Red River Delta is one of the river basins in Vietnam that drain the Himalayas, leading to a downstream deposition of sediments containing As-bearing Fe oxides (Fendorf et al., 2010). Under anoxic conditions with a large supply of organic carbon, the microbial reduction of Fe(III) oxides releases As into the groundwater (Fendorf et al., 2010). The Red River Delta areas with high As concentration induce elevated PO43−, NH4+, dissolved organic carbon (DOC) concentrations and negative redox potentials (Eh) (Winkel et al., 2011). This highlights that a predictive model is essential as a preliminary step to identify the contaminated tube wells in Red River Delta, where groundwater is the main drinking water source (Winkel et al., 2011) (Berg et al., 2007).
Machine learning (ML) is a computational method that builds models – classification, regression, and clustering –using inference and pattern recognition instead of a set of rules defined by the user (Dramsch, 2020). The models were built using the training set then its performance is evaluated using the test set, an unseen data. Machine learning has been used to predict As risk in Bangladesh (Tan et al., 2020), India (Podgorski et al., 2020), and globally (Podgorski and Berg, 2020).
Smith et al. (2018) stated that random forest is the best fit for his research regarding California groundwater arsenic thread: “The random forest models account for nonlinear relationships between multiple variables and handle outliers well”. The random forest algorithm creates models with an accuracy up to 0.86, where in general the good model results in 0.5 (random model) to 1 (perfect model). In the current studies, the models with random forest generally provide a better result than models with other methods. They resulted in a 0.84 accuracy for ﬂuoride contamination study in India (Ayotte et al., 2017) and 0.82 for arsenic contamination study in the USA (Podgorski et al., 2018). In the mentioned references, the other methods such as random forest, gradient boosting machine, extreme gradient boosting, and logistic regression are also included to build models.
The purpose of this study is to provide information of As contamination in groundwater that is currently unavailable due to the limitations and unaffordable technology. Therefore, this study provides prediction models of As contamination risk in the Red River Delta wells using four machine learning algorithms: random forest, gradient boosting machine, extreme gradient boosting, and logistic regression.
The study site is situated in the Red River Delta in Vietnam which stretches from 19°54′ to 21°36′ North Latitude and from 105°00′ to 107°12′ East Longitude. The soil profile is composed of Pleistocene and older sediments overlain by late Pleistocene-Holocene sediments through deposition (Winkel et al., 2008) (Ravenscroft et al., 2005). High levels of As are commonly found in shallower, Holocene aquifers while deeper Pliocene-Pleistocene aquifers contain significantly lower levels of As (Wallis et al., 2020) (Postma et al., 2012). Further details regarding the study site are presented in Figure 1, where the circle colour corresponds to the As concentration in the sampling site.
Groundwater hydrochemical data were obtained from a literature: Winkel et al. (2011). The total of 512 data containing 38 hydrochemical parameters from 2005 to 2007 period from Winkel et al. (2011), were adopted to develop predictive models in this study. Based on the data, it is known that As concentration ranged from 0 to 810 µg/L (median: 2 µg/L). The As concentration was binarized to a threshold of 10 µg/L based on the water-quality standard set by the WHO i.e. 1 for samples with As concentration > 10 µg/L and 0 if it is otherwise. Twenty-seven (27) percent of the samples had As values exceeding 10 µg/L, belonging to the positive class (1’s). This binarized data was used as the label or dependent variable for the machine learning algorithm. The distribution of As data is skewed to the right.
In this study, the used features consist of the wells depth which varies from 5 m to 135 m (median: 30 m), temperature(T) from 20°C to 29°C (median: 26°C) and Fe from 0 mg/L to 140 mg/L (median: 2 mg/L). Moreover, Eh was in the range of -203 mv to 504 mv (median: -64), HCO3- from 0 mg/L to 1,540 mg/L (median: 170 mg/L) and SO4 from 0 to 890 mg/L (median: 1 mg/L). Furthermore, the DOC ranged from 0 to 58 mg/L (median: 2 mg/L), I from 2 totµg/L to 480 totµg/L (median: 28 totµg/L) and PO43− from 0 to 6 mg/L (median: 0 mg/L). It also includes the values of pH from 3 to 8 (median: 7), dissolved oxygen from 0 mg/L to 5 mg/L (median: 0 mg/L) and PO4-P from 0 to 6 mg/L (median: 0 mg/L). Other features such as Mn varies from 0 mg/L to 16 mg/L (median: 0 mg/L), Sr from 0 mg/L to 620 mg/L (median: 68 mg/L) and Li from 0 µg/L to 29 µg/L (median: 3 µg/L). Meanwhile, several data found missing including depth, T, HCO3-, I, O2-diss, Li and Sr. Further details on the features mentioned above are presented in Supplementary.
According to Liu et al. (2020), the K-Nearest Neighbor (KNN) imputation method estimated missing value by separately finding the distances of each incomplete dataset and complete dataset. This method is used in this study to fill the missing value of hydrochemical data to build better prediction models.
The distances result the value of KNN and the mean value will be utilized to impute missing value in the dataset. This method is widely used due to its relatively high accuracy, where the equation of Euclidean distances determined by
The Logistic Regression (LR) is a generalized linear model using classification algorithm. The LR generates a functional form f and parameter vector α to determine
Specifically, it calculates the class membership probability for one of the two categories in the data set,
The hyperplane of all points x satisfying the equation α· x = 0 are the points for which P (1|x, α) = P (0|x, α) = 0.5, forms the decision boundary between the two classes (Dreiseitl and Ohno-Machado, 2002).
The Random Forest (RF) algorithm is a popular machine learning algorithm that lets an ensemble of decision trees vote for the most popular class. Some of the methods used to grow each tree in the ensemble are bagging (bootstrap aggregating), random split selection and the random subspace method (Breiman, 2001). The subspace randomization scheme is blended with bagging to resample the training dataset every time a single tree is grown, along with the replacement. The RF has variable importance which shows the interaction of the predictor variables in the model (Breiman, 2001). According to Breiman (2001), each individual tree is grown using CARD method without pruning.
Gradient Boosting Machine (GBM) is a tree boosting method used for classification and regression (Friedman, 2001). The GBM iteratively combines several “weak learners” to reduce the loss function, creating a “strong learner” with improved predictive performance. Four hyper-parameters need to be tuned in the GBM: the depth of decision trees, the number of iterations or decision trees, the learning rate, and the fraction of data that is used at each iterative step (Touzani et al., 2018).
Extreme Gradient Boosting or XGBoost (XGB) is an open-source scalable machine learning system for tree boosting (Chen and Guestrin, 2016). This machine learning method is popular for its scalability which can run more than ten times faster than current popular solutions on a single machine. This scalability is due to innovations such as a highly scalable end-to-end tree boosting system, a weighted quantile sketch for efficient proposal calculation, a sparsity-aware algorithm for parallel tree learning, and a cache-aware block structure for out-of-core tree learning (Chen and Guestrin, 2016).
The models were developed using scikit-learn (Pedregosa et al., 2011), a machine learning module available in Python using MacBook Air (13-inch, Early 2015) 1.6 GHz Intel Core i5 device. The missing values of data that had been mentioned in Section 2.2 were imputed using the KNN imputer (refer to Section 2.3). In addition, the features were scaled using StandardScaler to conduct Logistic Regression. The dataset was randomly split with a 7:3 ratio into training and test set, respectively. The split was stratified to maintain the ratio of positive and negative outcomes in the training and test set. Further details of the model development plan is presented as a flowchart in Fig. 2.
Hyper-parameter tuning is an optimization problem to capture the predictive performance of the model for a particular machine learning algorithm (Mantovani et al., 2015). Mantovani et al. (2015) stated that the predictive performance of models using Random Search is comparable to meta-heuristics and Grid Search at a lower computational cost. A random search with 10-fold cross-validation (CV) was performed 10 times for each split to tune the hyper-parameters of a machine learning model. The AUC-ROC (Area Under Curve of Receiver Operating Characteristic) was the metric for choosing the best model, which was then used to predict the test set to evaluate the model’s performance. The model with the best performance will be used to predict As contamination in groundwater.
The variable importance measures the impact or causal effect of the predictor variables (features) in predicting the dependent variable (Strobl et al., 2008). The variable importance was obtained for the RF and GBM algorithms to find out which features dominated the prediction of the model. The partial dependence plot (PDP) shows the dependence between the dependent variable and a (set of) feature(s) (Pedregosa et al., 2011) (Hastie et al., 2009). The PDP of the relevant features were plotted to study how the As contamination risk is affected by the respective feature.
The confusion matrix result of True Negative (TN), False Positive (FP), False Negative (FN) and True Positive (TP) is listed in the Table 1. This value was used to determine the performance of a classification on test set data. The GBM had the highest TP and lowest FN of 40 and 2, respectively. In addition, it also perfectly predicted the negative outcomes with TN and FP equal to 112 and 0, respectively. This indicates that the GBM is the best hyperparameter compared to other methods.
Based on the model assessment, the GBM and the RF performed the best and second best respectively. It is evaluated by models’ accuracy, precision, sensitivity, and specificity. The GBM and RF’s accuracy, precision, sensitivity, and specificity are presented in Table 2.
Fig. 3 shows the (3a) Precision-Recall Curve (PRC) and (3b) the Receiver Operating Characteristic (ROC) curve for all models. The PRC curves are located at the upper right with the area under the curve (AUC) values ranging from 88% to 92%. The ROC curves are located at the upper left and show good model performance with AUC values ranging from 94% to 96%. The GBM model had the highest AUC for both curves with 92% and 96% for the PRC and ROC curves, respectively.
The variable importance projection was plotted in Fig. 4 for the RF (4a) and GBM (4b) models. Both models have Eh as the most important variable, followed by Fe. Low values of Eh indicate reducing conditions which trigger the reductive dissolution of Fe oxides that releases the As from the sediments and into the groundwater.
High As concentrations are usually found in groundwater in contact with shallower, younger sediments (Wallis et al., 2020). The PDP shows a higher probability of As levels in shallower aquifers with depth < 50 m; the probability sharply decreases at depth > 65 m. The partial dependence of As probability for the RF model is shown in Fig. 5, refer to (5a).
Based on Fig. 5, it is shown that As levels have higher probability at negative Eh values (reducing) (5b). Where (5c) also exhibits a higher probability of As at low SO4 concentrations. High levels of SO4 triggers SO4 reduction which produces sulfide that bind As, inhibiting the release of As to the groundwater (Fendorf et al., 2010) (Buschmann and Berg, 2009).
However, (5d) and (5e) shows high probability of As at high levels of PO3− and DOC. The PO3− may compete with As for the adsorption sites in Fe oxides, keeping the As dissolved in groundwater (Kinniburgh, 2001) (Fendorf et al., 2010). A source of organic carbon is one of the requirements for the increase in As levels in groundwater because it drives the reductive dissolution of Fe oxides (Guo et al., 2019).
Machine learning models are a good initial step to detect As groundwater contamination. This study predicted the As contamination risk of wells in Red River Delta, Vietnam using four machine learning methods on publicly-available data. The best model that emerged was the GBM with an accuracy, precision, sensitivity, and specificity of 98.7%, 100%, 95.2%, and 100%, respectively. The GBM also had the highest AUC of 92% and 96% for the PRC and ROC curves, respectively. In this study, the GBM proved to build the best prediction model due to its highest performance on model assessment. It is caused by the method’s flexibility that can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit very flexible. Not only predicting groundwater contamination, but the GBM also has repeatedly proven to be one of the most powerful techniques to build prediction models in both classification and regression. As mentioned earlier, it has high flexibility and provides predictions with high accuracy.
The two most important variables for the model were Eh and Fe. The PDP’s four relevant features showed that As probability is likely to be high at shallower well depth, high PO4, negative Eh (reducing conditions), low SO4 concentration, and high DOC, conditions that trigger the reductive dissolution of Fe oxides, releasing As into the groundwater.
The limitation of this study would be the amount of data. It is known that larger amounts of data are required to build better models with higher capability of generalization. In addition, the better presentation of prediction data should be improved. A hazard map or open access website could not be produced due to time limitation. However, these limitations could be considered for future work.
This study shows that there is an alternative method to predict As contamination in groundwater, which is faster, easier and effective. The outcome of the prediction models is important as a base to determine As groundwater in an unknown area. It could be used in decision making by a local government and raise awareness of residents of the Red River Delta.
The authors highly appreciate Lenny H. E. Winkel, Pham Thi Kim Trang, Vi Mai Lan, Caroline Stengel, Manouchehr Amini, Nguyen Thi Ha, Pham Hung Viet, and Michael Berg for the publicly-available hydrochemical As data of Red River Delta.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C1094272).
Descriptive statistics of the features used for the model is provided separately in Appendix A: Supplementary data.