Research Paper

Split Viewer

Econ. Environ. Geol. 2022; 55(2): 127-135

Published online April 30, 2022

https://doi.org/10.9719/EEG.2022.55.2.127

© THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY

Predicting As Contamination Risk in Red River Delta using Machine Learning Algorithms

Zheina J. Ottong1, Reta L. Puspasari1, Daeung Yoon2, Kyoung-Woong Kim1,*

1School of Earth Sciences and Environmental Engineering, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, South Korea
2Chonnam National University, Gwangju 61186, South Korea

Correspondence to : *Corresponding author : kwkim@gist.ac.kr

Received: March 3, 2022; Revised: March 31, 2022; Accepted: April 3, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.

Abstract

Excessive presence of As level in groundwater is a major health problem worldwide. In the Red River Delta in Vietnam, several million residents possess a high risk of chronic As poisoning. The As releases into groundwater caused by natural process through microbially-driven reductive dissolution of Fe (III) oxides. It has been extracted by Red River residents using private tube wells for drinking and daily purposes because of their unawareness of the contamination. This long-term consumption of As-contaminated groundwater could lead to various health problems. Therefore, a predictive model would be useful to expose contamination risks of the wells in the Red River Delta Vietnam area. This study used four machine learning algorithms to predict the As probability of study sites in Red River Delta, Vietnam. The GBM was the best performing model with the accuracy, precision, sensitivity, and specificity of 98.7%, 100%, 95.2%, and 100%, respectively. In addition, it resulted the highest AUC of 92% and 96% for the PRC and ROC curves, with Eh and Fe as the most important variables. The partial dependence plot of As concentration on the model parameters showed that the probability of high level of As is related to the low number of wells’ depth, Eh, and SO4, along with high PO43− and NH4+. This condition triggers the reductive dissolution of iron phases, thus releasing As into groundwater.

Keywords groundwater arsenic, machine learning, predictive model, random forest, gradient boosting

  • Arsenic contamination in groundwater is naturally occurred and affecting millions of populations around the world.

  • Machine learning algorithms were utilized to predict As contamination in groundwater.

  • Four machine learning algorithms were included to build the models.

  • The GBM and RF are the best performing models to predict As groundwater contamination.

The excessive presence of Arsenic (As) level in groundwater which is often a product of natural processes, is a major health problem that affects several hundred million people around the world (Winkel et al., 2011) (Ravenscroft et al., 2011). This issue is mainly caused by the daily consumption of groundwater which contains elevated levels of As for most populations. Long-term exposure of arsenic to the human body through drinking water may lead to arsenicosis, which includes symptoms such as muscular weakness, skin pigmentation, and painful skin lesions, among others (Ravenscroft et al., 2011). As exposure also causes certain cancers and diseases of the liver, kidney, heart, brain, and lungs (Ravenscroft et al., 2011). Therefore, treatment of As in water is an emerging issue (Lee et al., 2020) (Kwon et al., 2020) (Choi et al., 2021).

Vietnam is one of developing countries in Southeast Asia which also ranked 15th most populous country in the world, with a 96.5 million population in 2019 (UN, 2019). According to Carrard et al. (2019), more than 55 million people in Vietnam rely on groundwater as a source of drinking and daily purposes. It is also discovered that three million people in Red River Delta are exposed to high level of As (>10 µgL−1, WHO standard), in addition to one million others consuming approximately >50 µgL−1 As levels daily through drinking untreated groundwater sourced from private tube wells (Winkel et al., 2011) (Berg et al., 2001). The Red River Delta is one of the river basins in Vietnam that drain the Himalayas, leading to a downstream deposition of sediments containing As-bearing Fe oxides (Fendorf et al., 2010). Under anoxic conditions with a large supply of organic carbon, the microbial reduction of Fe(III) oxides releases As into the groundwater (Fendorf et al., 2010). The Red River Delta areas with high As concentration induce elevated PO43−, NH4+, dissolved organic carbon (DOC) concentrations and negative redox potentials (Eh) (Winkel et al., 2011). This highlights that a predictive model is essential as a preliminary step to identify the contaminated tube wells in Red River Delta, where groundwater is the main drinking water source (Winkel et al., 2011) (Berg et al., 2007).

Machine learning (ML) is a computational method that builds models – classification, regression, and clustering –using inference and pattern recognition instead of a set of rules defined by the user (Dramsch, 2020). The models were built using the training set then its performance is evaluated using the test set, an unseen data. Machine learning has been used to predict As risk in Bangladesh (Tan et al., 2020), India (Podgorski et al., 2020), and globally (Podgorski and Berg, 2020).

Smith et al. (2018) stated that random forest is the best fit for his research regarding California groundwater arsenic thread: “The random forest models account for nonlinear relationships between multiple variables and handle outliers well”. The random forest algorithm creates models with an accuracy up to 0.86, where in general the good model results in 0.5 (random model) to 1 (perfect model). In the current studies, the models with random forest generally provide a better result than models with other methods. They resulted in a 0.84 accuracy for fluoride contamination study in India (Ayotte et al., 2017) and 0.82 for arsenic contamination study in the USA (Podgorski et al., 2018). In the mentioned references, the other methods such as random forest, gradient boosting machine, extreme gradient boosting, and logistic regression are also included to build models.

The purpose of this study is to provide information of As contamination in groundwater that is currently unavailable due to the limitations and unaffordable technology. Therefore, this study provides prediction models of As contamination risk in the Red River Delta wells using four machine learning algorithms: random forest, gradient boosting machine, extreme gradient boosting, and logistic regression.

2.1. Study Site

The study site is situated in the Red River Delta in Vietnam which stretches from 19°54′ to 21°36′ North Latitude and from 105°00′ to 107°12′ East Longitude. The soil profile is composed of Pleistocene and older sediments overlain by late Pleistocene-Holocene sediments through deposition (Winkel et al., 2008) (Ravenscroft et al., 2005). High levels of As are commonly found in shallower, Holocene aquifers while deeper Pliocene-Pleistocene aquifers contain significantly lower levels of As (Wallis et al., 2020) (Postma et al., 2012). Further details regarding the study site are presented in Figure 1, where the circle colour corresponds to the As concentration in the sampling site.

Fig. 1. Map of the Red River Delta (adapted from Winkel et al., 2011).

2.2. Data Acquisition and Description

Groundwater hydrochemical data were obtained from a literature: Winkel et al. (2011). The total of 512 data containing 38 hydrochemical parameters from 2005 to 2007 period from Winkel et al. (2011), were adopted to develop predictive models in this study. Based on the data, it is known that As concentration ranged from 0 to 810 µg/L (median: 2 µg/L). The As concentration was binarized to a threshold of 10 µg/L based on the water-quality standard set by the WHO i.e. 1 for samples with As concentration > 10 µg/L and 0 if it is otherwise. Twenty-seven (27) percent of the samples had As values exceeding 10 µg/L, belonging to the positive class (1’s). This binarized data was used as the label or dependent variable for the machine learning algorithm. The distribution of As data is skewed to the right.

In this study, the used features consist of the wells depth which varies from 5 m to 135 m (median: 30 m), temperature(T) from 20°C to 29°C (median: 26°C) and Fe from 0 mg/L to 140 mg/L (median: 2 mg/L). Moreover, Eh was in the range of -203 mv to 504 mv (median: -64), HCO3- from 0 mg/L to 1,540 mg/L (median: 170 mg/L) and SO4 from 0 to 890 mg/L (median: 1 mg/L). Furthermore, the DOC ranged from 0 to 58 mg/L (median: 2 mg/L), I from 2 totµg/L to 480 totµg/L (median: 28 totµg/L) and PO43− from 0 to 6 mg/L (median: 0 mg/L). It also includes the values of pH from 3 to 8 (median: 7), dissolved oxygen from 0 mg/L to 5 mg/L (median: 0 mg/L) and PO4-P from 0 to 6 mg/L (median: 0 mg/L). Other features such as Mn varies from 0 mg/L to 16 mg/L (median: 0 mg/L), Sr from 0 mg/L to 620 mg/L (median: 68 mg/L) and Li from 0 µg/L to 29 µg/L (median: 3 µg/L). Meanwhile, several data found missing including depth, T, HCO3-, I, O2-diss, Li and Sr. Further details on the features mentioned above are presented in Supplementary.

2.3. K-Nearest Neighbor

According to Liu et al. (2020), the K-Nearest Neighbor (KNN) imputation method estimated missing value by separately finding the distances of each incomplete dataset and complete dataset. This method is used in this study to fill the missing value of hydrochemical data to build better prediction models.

The distances result the value of KNN and the mean value will be utilized to impute missing value in the dataset. This method is widely used due to its relatively high accuracy, where the equation of Euclidean distances determined by

dij= xiwjTxiwj.

Where dij is Euclidean distance, xi and wj are two points with distance in between and T = 2 for Euclidean distance.

2.4. Machine Learning Algorithm

2.4.1. Logistic Regression

The Logistic Regression (LR) is a generalized linear model using classification algorithm. The LR generates a functional form f and parameter vector α to determine P (y|x), the probability of the class label y given the input variable x as

Py|x=fx,a.

Specifically, it calculates the class membership probability for one of the two categories in the data set,

P1|x,a=1 1+e(ax)1,

and

P0|x,a=1P1|x,a.

The hyperplane of all points x satisfying the equation α· x = 0 are the points for which P (1|x, α) = P (0|x, α) = 0.5, forms the decision boundary between the two classes (Dreiseitl and Ohno-Machado, 2002).

2.4.2. Random Forest

The Random Forest (RF) algorithm is a popular machine learning algorithm that lets an ensemble of decision trees vote for the most popular class. Some of the methods used to grow each tree in the ensemble are bagging (bootstrap aggregating), random split selection and the random subspace method (Breiman, 2001). The subspace randomization scheme is blended with bagging to resample the training dataset every time a single tree is grown, along with the replacement. The RF has variable importance which shows the interaction of the predictor variables in the model (Breiman, 2001). According to Breiman (2001), each individual tree is grown using CARD method without pruning.

2.4.3. Gradient Boosting Machine

Gradient Boosting Machine (GBM) is a tree boosting method used for classification and regression (Friedman, 2001). The GBM iteratively combines several “weak learners” to reduce the loss function, creating a “strong learner” with improved predictive performance. Four hyper-parameters need to be tuned in the GBM: the depth of decision trees, the number of iterations or decision trees, the learning rate, and the fraction of data that is used at each iterative step (Touzani et al., 2018).

2.4.4. Extreme Gradient Boosting

Extreme Gradient Boosting or XGBoost (XGB) is an open-source scalable machine learning system for tree boosting (Chen and Guestrin, 2016). This machine learning method is popular for its scalability which can run more than ten times faster than current popular solutions on a single machine. This scalability is due to innovations such as a highly scalable end-to-end tree boosting system, a weighted quantile sketch for efficient proposal calculation, a sparsity-aware algorithm for parallel tree learning, and a cache-aware block structure for out-of-core tree learning (Chen and Guestrin, 2016).

2.5. Model Development

The models were developed using scikit-learn (Pedregosa et al., 2011), a machine learning module available in Python using MacBook Air (13-inch, Early 2015) 1.6 GHz Intel Core i5 device. The missing values of data that had been mentioned in Section 2.2 were imputed using the KNN imputer (refer to Section 2.3). In addition, the features were scaled using StandardScaler to conduct Logistic Regression. The dataset was randomly split with a 7:3 ratio into training and test set, respectively. The split was stratified to maintain the ratio of positive and negative outcomes in the training and test set. Further details of the model development plan is presented as a flowchart in Fig. 2.

Fig. 2. Flowchart of machine learning analysis model.

Hyper-parameter tuning is an optimization problem to capture the predictive performance of the model for a particular machine learning algorithm (Mantovani et al., 2015). Mantovani et al. (2015) stated that the predictive performance of models using Random Search is comparable to meta-heuristics and Grid Search at a lower computational cost. A random search with 10-fold cross-validation (CV) was performed 10 times for each split to tune the hyper-parameters of a machine learning model. The AUC-ROC (Area Under Curve of Receiver Operating Characteristic) was the metric for choosing the best model, which was then used to predict the test set to evaluate the model’s performance. The model with the best performance will be used to predict As contamination in groundwater.

2.6. Variable Importance and Partial Dependence Plot

The variable importance measures the impact or causal effect of the predictor variables (features) in predicting the dependent variable (Strobl et al., 2008). The variable importance was obtained for the RF and GBM algorithms to find out which features dominated the prediction of the model. The partial dependence plot (PDP) shows the dependence between the dependent variable and a (set of) feature(s) (Pedregosa et al., 2011) (Hastie et al., 2009). The PDP of the relevant features were plotted to study how the As contamination risk is affected by the respective feature.

3.1. Model Performance

The confusion matrix result of True Negative (TN), False Positive (FP), False Negative (FN) and True Positive (TP) is listed in the Table 1. This value was used to determine the performance of a classification on test set data. The GBM had the highest TP and lowest FN of 40 and 2, respectively. In addition, it also perfectly predicted the negative outcomes with TN and FP equal to 112 and 0, respectively. This indicates that the GBM is the best hyperparameter compared to other methods.

Table 1 Selection of the best hyperparameter from the confusion matrix to predict the test set.

TNFPFNTP
LR11021131
RF1111834
GBM1120240
XGB11111032

TN: True Negative, FP: False Positive, FN: False Negative, TP: True Positive



Based on the model assessment, the GBM and the RF performed the best and second best respectively. It is evaluated by models’ accuracy, precision, sensitivity, and specificity. The GBM and RF’s accuracy, precision, sensitivity, and specificity are presented in Table 2.

Table 2 Accuracy, precision, sensitivity, and specificity of GBM and RF in percent (%)

AccuracyPrecisionSensitivitySpecificity
GBM98.710095.2100
RF94.197.180.999.1


Fig. 3 shows the (3a) Precision-Recall Curve (PRC) and (3b) the Receiver Operating Characteristic (ROC) curve for all models. The PRC curves are located at the upper right with the area under the curve (AUC) values ranging from 88% to 92%. The ROC curves are located at the upper left and show good model performance with AUC values ranging from 94% to 96%. The GBM model had the highest AUC for both curves with 92% and 96% for the PRC and ROC curves, respectively.

Fig. 3. (a) Precision-recall curve and (b) Receiver Operating Characteristic (ROC) curve for all models.

3.2. Variable Importance and PDP

The variable importance projection was plotted in Fig. 4 for the RF (4a) and GBM (4b) models. Both models have Eh as the most important variable, followed by Fe. Low values of Eh indicate reducing conditions which trigger the reductive dissolution of Fe oxides that releases the As from the sediments and into the groundwater.

Fig. 4. Variable importance projection of (a) RF and (b) GBM model.

High As concentrations are usually found in groundwater in contact with shallower, younger sediments (Wallis et al., 2020). The PDP shows a higher probability of As levels in shallower aquifers with depth < 50 m; the probability sharply decreases at depth > 65 m. The partial dependence of As probability for the RF model is shown in Fig. 5, refer to (5a).

Fig. 5. PDP of As concentration with (a) depth, (b) Eh, (c) SO4, (d) PO3− and (e) DOC for the RF model.

Based on Fig. 5, it is shown that As levels have higher probability at negative Eh values (reducing) (5b). Where (5c) also exhibits a higher probability of As at low SO4 concentrations. High levels of SO4 triggers SO4 reduction which produces sulfide that bind As, inhibiting the release of As to the groundwater (Fendorf et al., 2010) (Buschmann and Berg, 2009).

However, (5d) and (5e) shows high probability of As at high levels of PO3− and DOC. The PO3− may compete with As for the adsorption sites in Fe oxides, keeping the As dissolved in groundwater (Kinniburgh, 2001) (Fendorf et al., 2010). A source of organic carbon is one of the requirements for the increase in As levels in groundwater because it drives the reductive dissolution of Fe oxides (Guo et al., 2019).

Machine learning models are a good initial step to detect As groundwater contamination. This study predicted the As contamination risk of wells in Red River Delta, Vietnam using four machine learning methods on publicly-available data. The best model that emerged was the GBM with an accuracy, precision, sensitivity, and specificity of 98.7%, 100%, 95.2%, and 100%, respectively. The GBM also had the highest AUC of 92% and 96% for the PRC and ROC curves, respectively. In this study, the GBM proved to build the best prediction model due to its highest performance on model assessment. It is caused by the method’s flexibility that can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit very flexible. Not only predicting groundwater contamination, but the GBM also has repeatedly proven to be one of the most powerful techniques to build prediction models in both classification and regression. As mentioned earlier, it has high flexibility and provides predictions with high accuracy.

The two most important variables for the model were Eh and Fe. The PDP’s four relevant features showed that As probability is likely to be high at shallower well depth, high PO4, negative Eh (reducing conditions), low SO4 concentration, and high DOC, conditions that trigger the reductive dissolution of Fe oxides, releasing As into the groundwater.

The limitation of this study would be the amount of data. It is known that larger amounts of data are required to build better models with higher capability of generalization. In addition, the better presentation of prediction data should be improved. A hazard map or open access website could not be produced due to time limitation. However, these limitations could be considered for future work.

This study shows that there is an alternative method to predict As contamination in groundwater, which is faster, easier and effective. The outcome of the prediction models is important as a base to determine As groundwater in an unknown area. It could be used in decision making by a local government and raise awareness of residents of the Red River Delta.

The authors highly appreciate Lenny H. E. Winkel, Pham Thi Kim Trang, Vi Mai Lan, Caroline Stengel, Manouchehr Amini, Nguyen Thi Ha, Pham Hung Viet, and Michael Berg for the publicly-available hydrochemical As data of Red River Delta.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C1094272).

Descriptive statistics of the features used for the model is provided separately in Appendix A: Supplementary data.

  1. Ayotte, J.D., Medalie, L., Qi, S.L., Backer, L.C. and Nolan, B.T. (2017) Estimating the high-arsenic domestic-well population in the conterminous United States. Env. Sci. Tech., v.51(21), p.12443-12454. doi: 10.1021/acs.est.7b02881
    Pubmed KoreaMed CrossRef
  2. Berg, M., Stengel, C., Trang, P.T.K., Viet, P.H., Sampson, M.L., Leng, M., Samreth, S. and Fredericks, D. (2007) Magnitude of arsenic pollution in the Mekong and red river deltas-Cambodia and Vietnam. Sci. Tot. Environ., v.372, p.413-425. doi: 10.1016/j.scitotenv.2006.09.010
    Pubmed CrossRef
  3. Berg, M., Tran, H.C., Nguyen, T.C., Pham, H.V., Schertenleib, R., Giger, W. (2001) Arsenic contamination of groundwater and drinking water in Vietnam: a human health threat. Env. Sci. Tech., v.35, p.2621-2626. doi: 10.1021/es010027y
    Pubmed CrossRef
  4. Breiman, L. (2001) Random forests. Mach. Learn 45, 5-32. doi: https://doi.org/10.1023/A:1010933404324
    CrossRef
  5. Buschmann, J. and Berg, M. (2009) Impact of sulfate reduction on the scale of arsenic contamination in groundwater of the Mekong, Bengal and red river deltas. Appl. Geochem., v.24, p.1278-1286. doi: 10.1016/j.apgeochem.2009.04.002
    CrossRef
  6. Carrard, N., Foster, T. and Willetts, J. (2019) Groundwater as a source of drinking water in Southeast Asia and the Pacific: A multi-country review of current reliance and resource concerns. Water 2019 11, 1065. https://doi.org/10.3390/w11081605
    CrossRef
  7. Chen, T. and Guestrin, C. (2016) Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD, pp. 785-794. https://doi.org/10.1145/2939672.2939785
    CrossRef
  8. Choi, K.-W., Park, S.-S., Kang, C.-U., Lee, J.H. and Kim, S.J. (2021) A comparison study of alum sludge and ferric hydroxide based absorbents for arsenic adsorption from mine water. Econ. Environ. Geol., v.54, p.689-698. https://doi.org/10.9719/EEG.2021.54.6.689
    CrossRef
  9. Dramsch, J.S. (2020) 70 years of machine learning in geoscience in review. ADGEO. https://doi.org/10.1016/bs.agph.2020.08.002
    KoreaMed CrossRef
  10. Dreiseitl, S. and Ohno-Machado, L. (2002) Logistic regression and artificial neural network classification models: a methodology review. JBI, v.35, p.352-359. https://doi.org/10.1016/S1532-0464(03)00034-0
    CrossRef
  11. Fendorf, S., Michael, H.A. and van Geen, A. (2010) Spatial and temporal variations of groundwater arsenic in South and Southeast Asia. Science, v.328, p.1123-1127. doi: 10.1126/science.1172974
    Pubmed CrossRef
  12. Friedman, J.H. (2001) Greedy function approximation: a gradient boosting machine. Ann. Stat., p.1189-1232.
    CrossRef
  13. Guo, H., Li, X., Xiu, W., He, W., Cao, Y., Zhang, D. and Wang, A. (2019) Controls of organic matter bioreactivity on arsenic mobility in shallow aquifers of the Hetao Basin, P.R. China. Journal of Hydrology, v.571, p.448-459. https://doi.org/10.1016/j.jhydrol.2019.01.076.
    CrossRef
  14. Hastie, T., Tibshirani, R. and Friedman, J. (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. https://doi.org/10.1007/978-0-387-21606-5
    CrossRef
  15. Kinniburgh, D. (2001) Arsenic contamination of groundwater in Bangladesh. final report. http://www.bgs.ac. uk/research/groundwater/health/arsenic/Bangladesh/reports.
  16. Kwon, O.-H., Park, H.-S., Lee, J.S. and Ji, W.H. (2020) A field study on the application of pilot-scale vertical flow reactor system into the removal of Fe, As and Mn in mine drainage. Econ. and Environ. Geol., v.53, p.695-701. https://doi.org/10.9719/EEG.2020.53.6.695
  17. Lee, J.H., Ji, W.H., Lee, J.S., Park, S.S., Choi, K.W., Kang, C.U. and Kim, S.J. (2020) A study of fluoride and arsenic adsorption from aqueous solution using alum sludge based absorbent. Econ. Environ. Geol., v.53, p.667-675. https://doi.org/10.9719/EEG.2020.53.6.667
  18. Liu, X., Lai, X. and Zhang, L. (2020) A hierarchical missing value imputation method by correlation-based K-Nearest Neighbors. In: Bi Y., Bhatia R., Kapoor S. (eds.) Intelligent Systems and Applications. Advances in Intelligent Systems and Computing, 1037. Springer, Cham. https://doi.org/10.1007/978-3-030-29516-5_38.
    CrossRef
  19. Mantovani, R.G., Rossi, A.L., Vanschoren, J., Bischl, B. and De Carvalho, A.C. (2015) Effectiveness of random search in svm hyper-parameter tuning, in: 2015 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1-8. doi: 10.1109/IJCNN.2015.7280664
    KoreaMed CrossRef
  20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011) Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., v.12, p.2825-2830. https://dl.acm.org/doi/10.5555/1953048.2078195
  21. Podgorski, J. and Berg, M. (2020) Global threat of arsenic in groundwater. Science, v.368, p.845-850. doi: 10.1126/science.aba1510
    Pubmed CrossRef
  22. Podgorski, J., Wu, R., Chakravorty, B. and Polya, D.A. (2020) Groundwater arsenic distribution in India by machine learning geospatial modeling. Int. J. Environ. Res. Public Health, v.17, p.7119. https://doi.org/10.3390/ijerph17197119
    Pubmed KoreaMed CrossRef
  23. Podgorski, J.E., Labhasetwar, P., Saha, D. and Berg, M. (2018) Prediction modelling and mapping of groundwater fluoride contamination throughout India. Environ. Sci. Technol., v.52(17), p.9889-9898. https://doi.org/10.1021/acs.est.8b01679
    Pubmed CrossRef
  24. Postma, D., Larsen, F., Thai, N.T., Trang, P.T.K., Jakobsen, R., Nhan, P.Q., Long, T.V., Viet, P.H. and Murray, A.S. (2012) Groundwater arsenic concentrations in Vietnam controlled by sediment age. Nat. Geosci., v.5, p.656-661. https://doi.org/10.1038/ngeo1540
    CrossRef
  25. Ravenscroft, P., Brammer, H. and Richards, K. (2011) Arsenic pollution: a global synthesis. volume 94. John Wiley & Sons. Ravenscroft, P., Burgess, W.G., Ahmed, K.M., Burren, M. and Perrin, J. (2005) Arsenic in ground- water of the bengal basin, bangladesh: Distribution, field relations, and hydrogeological setting. Hydrogeol. J., v.13, p.727-751. doi: 10.1007/s10040-003-0314-0
    CrossRef
  26. Smith, R., Knight, R. and Fendorf, S. (2018) Overpumping leads to California groundwater arsenic threat. Nature Communications, v.9(2), p.115. doi: 10.1038/s41467-018-04475-3
    Pubmed KoreaMed CrossRef
  27. Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T. and Zeileis, A. (2008) Conditional variable importance for random forests. BMC Bioinformatics, v.9, p.307. doi: 10.1186/1471-2105-9-307
    Pubmed KoreaMed CrossRef
  28. Tan, Z., Yang, Q. and Zheng, Y. (2020) Machine learning models of groundwater arsenic spatial distribution in bangladesh: Influence of holocene sediment depositional history. Environ. Sci. Technol., v.54, p.9454-9463. https://doi.org/10.1021/acs.est.0c03617
    Pubmed CrossRef
  29. Touzani, S., Granderson, J. and Fernandes, S. (2018) Gradient boosting machine for modeling the energy consumption of commercial buildings. Energy and Buildings, v.158, p.1533-1543. https://doi.org/10.1016/j.enbuild.2017.11.039
    CrossRef
  30. United Nations, Department of Economic and Social Affairs Population Division (2019) World Population Prospects 2019, Volume II: Demographic Profiles Vietnam. Available at: https://population.un.org/wpp/Graphs/1_Demographic%20Profiles/Viet%20Nam.pdf.
  31. Wallis, I., Prommer, H., Berg, M., Siade, A.J., Sun, J. and Kipfer, R. (2020) The river-groundwater interface as a hotspot for arsenic release. Nature Geoscience, v.13, p.288-295. doi: 10.1038/s41561-020-0557-6
    CrossRef
  32. Winkel, L., Berg, M., Amini, M., Hug, S.J. and Johnson, C.A. (2008) Predicting groundwater arsenic contamination in Southeast Asia from surface parameters. Nat. Geosci., v.1, p.536-542.
    CrossRef
  33. Winkel, L.H., Trang, P.T.K., Lan, V.M., Stengel, C., Amini, M., Ha, N.T., Viet, P.H. and Berg, M. (2011) Arsenic pollution of groundwater in Vietnam exacerbated by deep aquifer exploitation for more than a century. PNAS, v.108, p.1246-1251. doi: 10.1073/pnas.1011915108
    Pubmed KoreaMed CrossRef

Article

Research Paper

Econ. Environ. Geol. 2022; 55(2): 127-135

Published online April 30, 2022 https://doi.org/10.9719/EEG.2022.55.2.127

Copyright © THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY.

Predicting As Contamination Risk in Red River Delta using Machine Learning Algorithms

Zheina J. Ottong1, Reta L. Puspasari1, Daeung Yoon2, Kyoung-Woong Kim1,*

1School of Earth Sciences and Environmental Engineering, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, South Korea
2Chonnam National University, Gwangju 61186, South Korea

Correspondence to:*Corresponding author : kwkim@gist.ac.kr

Received: March 3, 2022; Revised: March 31, 2022; Accepted: April 3, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.

Abstract

Excessive presence of As level in groundwater is a major health problem worldwide. In the Red River Delta in Vietnam, several million residents possess a high risk of chronic As poisoning. The As releases into groundwater caused by natural process through microbially-driven reductive dissolution of Fe (III) oxides. It has been extracted by Red River residents using private tube wells for drinking and daily purposes because of their unawareness of the contamination. This long-term consumption of As-contaminated groundwater could lead to various health problems. Therefore, a predictive model would be useful to expose contamination risks of the wells in the Red River Delta Vietnam area. This study used four machine learning algorithms to predict the As probability of study sites in Red River Delta, Vietnam. The GBM was the best performing model with the accuracy, precision, sensitivity, and specificity of 98.7%, 100%, 95.2%, and 100%, respectively. In addition, it resulted the highest AUC of 92% and 96% for the PRC and ROC curves, with Eh and Fe as the most important variables. The partial dependence plot of As concentration on the model parameters showed that the probability of high level of As is related to the low number of wells’ depth, Eh, and SO4, along with high PO43− and NH4+. This condition triggers the reductive dissolution of iron phases, thus releasing As into groundwater.

Keywords groundwater arsenic, machine learning, predictive model, random forest, gradient boosting

Research Highlights

  • Arsenic contamination in groundwater is naturally occurred and affecting millions of populations around the world.

  • Machine learning algorithms were utilized to predict As contamination in groundwater.

  • Four machine learning algorithms were included to build the models.

  • The GBM and RF are the best performing models to predict As groundwater contamination.

1. Introduction

The excessive presence of Arsenic (As) level in groundwater which is often a product of natural processes, is a major health problem that affects several hundred million people around the world (Winkel et al., 2011) (Ravenscroft et al., 2011). This issue is mainly caused by the daily consumption of groundwater which contains elevated levels of As for most populations. Long-term exposure of arsenic to the human body through drinking water may lead to arsenicosis, which includes symptoms such as muscular weakness, skin pigmentation, and painful skin lesions, among others (Ravenscroft et al., 2011). As exposure also causes certain cancers and diseases of the liver, kidney, heart, brain, and lungs (Ravenscroft et al., 2011). Therefore, treatment of As in water is an emerging issue (Lee et al., 2020) (Kwon et al., 2020) (Choi et al., 2021).

Vietnam is one of developing countries in Southeast Asia which also ranked 15th most populous country in the world, with a 96.5 million population in 2019 (UN, 2019). According to Carrard et al. (2019), more than 55 million people in Vietnam rely on groundwater as a source of drinking and daily purposes. It is also discovered that three million people in Red River Delta are exposed to high level of As (>10 µgL−1, WHO standard), in addition to one million others consuming approximately >50 µgL−1 As levels daily through drinking untreated groundwater sourced from private tube wells (Winkel et al., 2011) (Berg et al., 2001). The Red River Delta is one of the river basins in Vietnam that drain the Himalayas, leading to a downstream deposition of sediments containing As-bearing Fe oxides (Fendorf et al., 2010). Under anoxic conditions with a large supply of organic carbon, the microbial reduction of Fe(III) oxides releases As into the groundwater (Fendorf et al., 2010). The Red River Delta areas with high As concentration induce elevated PO43−, NH4+, dissolved organic carbon (DOC) concentrations and negative redox potentials (Eh) (Winkel et al., 2011). This highlights that a predictive model is essential as a preliminary step to identify the contaminated tube wells in Red River Delta, where groundwater is the main drinking water source (Winkel et al., 2011) (Berg et al., 2007).

Machine learning (ML) is a computational method that builds models – classification, regression, and clustering –using inference and pattern recognition instead of a set of rules defined by the user (Dramsch, 2020). The models were built using the training set then its performance is evaluated using the test set, an unseen data. Machine learning has been used to predict As risk in Bangladesh (Tan et al., 2020), India (Podgorski et al., 2020), and globally (Podgorski and Berg, 2020).

Smith et al. (2018) stated that random forest is the best fit for his research regarding California groundwater arsenic thread: “The random forest models account for nonlinear relationships between multiple variables and handle outliers well”. The random forest algorithm creates models with an accuracy up to 0.86, where in general the good model results in 0.5 (random model) to 1 (perfect model). In the current studies, the models with random forest generally provide a better result than models with other methods. They resulted in a 0.84 accuracy for fluoride contamination study in India (Ayotte et al., 2017) and 0.82 for arsenic contamination study in the USA (Podgorski et al., 2018). In the mentioned references, the other methods such as random forest, gradient boosting machine, extreme gradient boosting, and logistic regression are also included to build models.

The purpose of this study is to provide information of As contamination in groundwater that is currently unavailable due to the limitations and unaffordable technology. Therefore, this study provides prediction models of As contamination risk in the Red River Delta wells using four machine learning algorithms: random forest, gradient boosting machine, extreme gradient boosting, and logistic regression.

2. Materials and Methods

2.1. Study Site

The study site is situated in the Red River Delta in Vietnam which stretches from 19°54′ to 21°36′ North Latitude and from 105°00′ to 107°12′ East Longitude. The soil profile is composed of Pleistocene and older sediments overlain by late Pleistocene-Holocene sediments through deposition (Winkel et al., 2008) (Ravenscroft et al., 2005). High levels of As are commonly found in shallower, Holocene aquifers while deeper Pliocene-Pleistocene aquifers contain significantly lower levels of As (Wallis et al., 2020) (Postma et al., 2012). Further details regarding the study site are presented in Figure 1, where the circle colour corresponds to the As concentration in the sampling site.

Figure 1. Map of the Red River Delta (adapted from Winkel et al., 2011).

2.2. Data Acquisition and Description

Groundwater hydrochemical data were obtained from a literature: Winkel et al. (2011). The total of 512 data containing 38 hydrochemical parameters from 2005 to 2007 period from Winkel et al. (2011), were adopted to develop predictive models in this study. Based on the data, it is known that As concentration ranged from 0 to 810 µg/L (median: 2 µg/L). The As concentration was binarized to a threshold of 10 µg/L based on the water-quality standard set by the WHO i.e. 1 for samples with As concentration > 10 µg/L and 0 if it is otherwise. Twenty-seven (27) percent of the samples had As values exceeding 10 µg/L, belonging to the positive class (1’s). This binarized data was used as the label or dependent variable for the machine learning algorithm. The distribution of As data is skewed to the right.

In this study, the used features consist of the wells depth which varies from 5 m to 135 m (median: 30 m), temperature(T) from 20°C to 29°C (median: 26°C) and Fe from 0 mg/L to 140 mg/L (median: 2 mg/L). Moreover, Eh was in the range of -203 mv to 504 mv (median: -64), HCO3- from 0 mg/L to 1,540 mg/L (median: 170 mg/L) and SO4 from 0 to 890 mg/L (median: 1 mg/L). Furthermore, the DOC ranged from 0 to 58 mg/L (median: 2 mg/L), I from 2 totµg/L to 480 totµg/L (median: 28 totµg/L) and PO43− from 0 to 6 mg/L (median: 0 mg/L). It also includes the values of pH from 3 to 8 (median: 7), dissolved oxygen from 0 mg/L to 5 mg/L (median: 0 mg/L) and PO4-P from 0 to 6 mg/L (median: 0 mg/L). Other features such as Mn varies from 0 mg/L to 16 mg/L (median: 0 mg/L), Sr from 0 mg/L to 620 mg/L (median: 68 mg/L) and Li from 0 µg/L to 29 µg/L (median: 3 µg/L). Meanwhile, several data found missing including depth, T, HCO3-, I, O2-diss, Li and Sr. Further details on the features mentioned above are presented in Supplementary.

2.3. K-Nearest Neighbor

According to Liu et al. (2020), the K-Nearest Neighbor (KNN) imputation method estimated missing value by separately finding the distances of each incomplete dataset and complete dataset. This method is used in this study to fill the missing value of hydrochemical data to build better prediction models.

The distances result the value of KNN and the mean value will be utilized to impute missing value in the dataset. This method is widely used due to its relatively high accuracy, where the equation of Euclidean distances determined by

dij= xiwjTxiwj.

Where dij is Euclidean distance, xi and wj are two points with distance in between and T = 2 for Euclidean distance.

2.4. Machine Learning Algorithm

2.4.1. Logistic Regression

The Logistic Regression (LR) is a generalized linear model using classification algorithm. The LR generates a functional form f and parameter vector α to determine P (y|x), the probability of the class label y given the input variable x as

Py|x=fx,a.

Specifically, it calculates the class membership probability for one of the two categories in the data set,

P1|x,a=1 1+e(ax)1,

and

P0|x,a=1P1|x,a.

The hyperplane of all points x satisfying the equation α· x = 0 are the points for which P (1|x, α) = P (0|x, α) = 0.5, forms the decision boundary between the two classes (Dreiseitl and Ohno-Machado, 2002).

2.4.2. Random Forest

The Random Forest (RF) algorithm is a popular machine learning algorithm that lets an ensemble of decision trees vote for the most popular class. Some of the methods used to grow each tree in the ensemble are bagging (bootstrap aggregating), random split selection and the random subspace method (Breiman, 2001). The subspace randomization scheme is blended with bagging to resample the training dataset every time a single tree is grown, along with the replacement. The RF has variable importance which shows the interaction of the predictor variables in the model (Breiman, 2001). According to Breiman (2001), each individual tree is grown using CARD method without pruning.

2.4.3. Gradient Boosting Machine

Gradient Boosting Machine (GBM) is a tree boosting method used for classification and regression (Friedman, 2001). The GBM iteratively combines several “weak learners” to reduce the loss function, creating a “strong learner” with improved predictive performance. Four hyper-parameters need to be tuned in the GBM: the depth of decision trees, the number of iterations or decision trees, the learning rate, and the fraction of data that is used at each iterative step (Touzani et al., 2018).

2.4.4. Extreme Gradient Boosting

Extreme Gradient Boosting or XGBoost (XGB) is an open-source scalable machine learning system for tree boosting (Chen and Guestrin, 2016). This machine learning method is popular for its scalability which can run more than ten times faster than current popular solutions on a single machine. This scalability is due to innovations such as a highly scalable end-to-end tree boosting system, a weighted quantile sketch for efficient proposal calculation, a sparsity-aware algorithm for parallel tree learning, and a cache-aware block structure for out-of-core tree learning (Chen and Guestrin, 2016).

2.5. Model Development

The models were developed using scikit-learn (Pedregosa et al., 2011), a machine learning module available in Python using MacBook Air (13-inch, Early 2015) 1.6 GHz Intel Core i5 device. The missing values of data that had been mentioned in Section 2.2 were imputed using the KNN imputer (refer to Section 2.3). In addition, the features were scaled using StandardScaler to conduct Logistic Regression. The dataset was randomly split with a 7:3 ratio into training and test set, respectively. The split was stratified to maintain the ratio of positive and negative outcomes in the training and test set. Further details of the model development plan is presented as a flowchart in Fig. 2.

Figure 2. Flowchart of machine learning analysis model.

Hyper-parameter tuning is an optimization problem to capture the predictive performance of the model for a particular machine learning algorithm (Mantovani et al., 2015). Mantovani et al. (2015) stated that the predictive performance of models using Random Search is comparable to meta-heuristics and Grid Search at a lower computational cost. A random search with 10-fold cross-validation (CV) was performed 10 times for each split to tune the hyper-parameters of a machine learning model. The AUC-ROC (Area Under Curve of Receiver Operating Characteristic) was the metric for choosing the best model, which was then used to predict the test set to evaluate the model’s performance. The model with the best performance will be used to predict As contamination in groundwater.

2.6. Variable Importance and Partial Dependence Plot

The variable importance measures the impact or causal effect of the predictor variables (features) in predicting the dependent variable (Strobl et al., 2008). The variable importance was obtained for the RF and GBM algorithms to find out which features dominated the prediction of the model. The partial dependence plot (PDP) shows the dependence between the dependent variable and a (set of) feature(s) (Pedregosa et al., 2011) (Hastie et al., 2009). The PDP of the relevant features were plotted to study how the As contamination risk is affected by the respective feature.

3. Results and Discussions

3.1. Model Performance

The confusion matrix result of True Negative (TN), False Positive (FP), False Negative (FN) and True Positive (TP) is listed in the Table 1. This value was used to determine the performance of a classification on test set data. The GBM had the highest TP and lowest FN of 40 and 2, respectively. In addition, it also perfectly predicted the negative outcomes with TN and FP equal to 112 and 0, respectively. This indicates that the GBM is the best hyperparameter compared to other methods.

Table 1 . Selection of the best hyperparameter from the confusion matrix to predict the test set..

TNFPFNTP
LR11021131
RF1111834
GBM1120240
XGB11111032

TN: True Negative, FP: False Positive, FN: False Negative, TP: True Positive.



Based on the model assessment, the GBM and the RF performed the best and second best respectively. It is evaluated by models’ accuracy, precision, sensitivity, and specificity. The GBM and RF’s accuracy, precision, sensitivity, and specificity are presented in Table 2.

Table 2 . Accuracy, precision, sensitivity, and specificity of GBM and RF in percent (%).

AccuracyPrecisionSensitivitySpecificity
GBM98.710095.2100
RF94.197.180.999.1


Fig. 3 shows the (3a) Precision-Recall Curve (PRC) and (3b) the Receiver Operating Characteristic (ROC) curve for all models. The PRC curves are located at the upper right with the area under the curve (AUC) values ranging from 88% to 92%. The ROC curves are located at the upper left and show good model performance with AUC values ranging from 94% to 96%. The GBM model had the highest AUC for both curves with 92% and 96% for the PRC and ROC curves, respectively.

Figure 3. (a) Precision-recall curve and (b) Receiver Operating Characteristic (ROC) curve for all models.

3.2. Variable Importance and PDP

The variable importance projection was plotted in Fig. 4 for the RF (4a) and GBM (4b) models. Both models have Eh as the most important variable, followed by Fe. Low values of Eh indicate reducing conditions which trigger the reductive dissolution of Fe oxides that releases the As from the sediments and into the groundwater.

Figure 4. Variable importance projection of (a) RF and (b) GBM model.

High As concentrations are usually found in groundwater in contact with shallower, younger sediments (Wallis et al., 2020). The PDP shows a higher probability of As levels in shallower aquifers with depth < 50 m; the probability sharply decreases at depth > 65 m. The partial dependence of As probability for the RF model is shown in Fig. 5, refer to (5a).

Figure 5. PDP of As concentration with (a) depth, (b) Eh, (c) SO4, (d) PO3− and (e) DOC for the RF model.

Based on Fig. 5, it is shown that As levels have higher probability at negative Eh values (reducing) (5b). Where (5c) also exhibits a higher probability of As at low SO4 concentrations. High levels of SO4 triggers SO4 reduction which produces sulfide that bind As, inhibiting the release of As to the groundwater (Fendorf et al., 2010) (Buschmann and Berg, 2009).

However, (5d) and (5e) shows high probability of As at high levels of PO3− and DOC. The PO3− may compete with As for the adsorption sites in Fe oxides, keeping the As dissolved in groundwater (Kinniburgh, 2001) (Fendorf et al., 2010). A source of organic carbon is one of the requirements for the increase in As levels in groundwater because it drives the reductive dissolution of Fe oxides (Guo et al., 2019).

4. Conclusions

Machine learning models are a good initial step to detect As groundwater contamination. This study predicted the As contamination risk of wells in Red River Delta, Vietnam using four machine learning methods on publicly-available data. The best model that emerged was the GBM with an accuracy, precision, sensitivity, and specificity of 98.7%, 100%, 95.2%, and 100%, respectively. The GBM also had the highest AUC of 92% and 96% for the PRC and ROC curves, respectively. In this study, the GBM proved to build the best prediction model due to its highest performance on model assessment. It is caused by the method’s flexibility that can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit very flexible. Not only predicting groundwater contamination, but the GBM also has repeatedly proven to be one of the most powerful techniques to build prediction models in both classification and regression. As mentioned earlier, it has high flexibility and provides predictions with high accuracy.

The two most important variables for the model were Eh and Fe. The PDP’s four relevant features showed that As probability is likely to be high at shallower well depth, high PO4, negative Eh (reducing conditions), low SO4 concentration, and high DOC, conditions that trigger the reductive dissolution of Fe oxides, releasing As into the groundwater.

The limitation of this study would be the amount of data. It is known that larger amounts of data are required to build better models with higher capability of generalization. In addition, the better presentation of prediction data should be improved. A hazard map or open access website could not be produced due to time limitation. However, these limitations could be considered for future work.

This study shows that there is an alternative method to predict As contamination in groundwater, which is faster, easier and effective. The outcome of the prediction models is important as a base to determine As groundwater in an unknown area. It could be used in decision making by a local government and raise awareness of residents of the Red River Delta.

Acknowledgments

The authors highly appreciate Lenny H. E. Winkel, Pham Thi Kim Trang, Vi Mai Lan, Caroline Stengel, Manouchehr Amini, Nguyen Thi Ha, Pham Hung Viet, and Michael Berg for the publicly-available hydrochemical As data of Red River Delta.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C1094272).

Appendix A Supplementary data

Descriptive statistics of the features used for the model is provided separately in Appendix A: Supplementary data.

Fig 1.

Figure 1.Map of the Red River Delta (adapted from Winkel et al., 2011).
Economic and Environmental Geology 2022; 55: 127-135https://doi.org/10.9719/EEG.2022.55.2.127

Fig 2.

Figure 2.Flowchart of machine learning analysis model.
Economic and Environmental Geology 2022; 55: 127-135https://doi.org/10.9719/EEG.2022.55.2.127

Fig 3.

Figure 3.(a) Precision-recall curve and (b) Receiver Operating Characteristic (ROC) curve for all models.
Economic and Environmental Geology 2022; 55: 127-135https://doi.org/10.9719/EEG.2022.55.2.127

Fig 4.

Figure 4.Variable importance projection of (a) RF and (b) GBM model.
Economic and Environmental Geology 2022; 55: 127-135https://doi.org/10.9719/EEG.2022.55.2.127

Fig 5.

Figure 5.PDP of As concentration with (a) depth, (b) Eh, (c) SO4, (d) PO3− and (e) DOC for the RF model.
Economic and Environmental Geology 2022; 55: 127-135https://doi.org/10.9719/EEG.2022.55.2.127

Table 1 . Selection of the best hyperparameter from the confusion matrix to predict the test set..

TNFPFNTP
LR11021131
RF1111834
GBM1120240
XGB11111032

TN: True Negative, FP: False Positive, FN: False Negative, TP: True Positive.


Table 2 . Accuracy, precision, sensitivity, and specificity of GBM and RF in percent (%).

AccuracyPrecisionSensitivitySpecificity
GBM98.710095.2100
RF94.197.180.999.1

References

  1. Ayotte, J.D., Medalie, L., Qi, S.L., Backer, L.C. and Nolan, B.T. (2017) Estimating the high-arsenic domestic-well population in the conterminous United States. Env. Sci. Tech., v.51(21), p.12443-12454. doi: 10.1021/acs.est.7b02881
    Pubmed KoreaMed CrossRef
  2. Berg, M., Stengel, C., Trang, P.T.K., Viet, P.H., Sampson, M.L., Leng, M., Samreth, S. and Fredericks, D. (2007) Magnitude of arsenic pollution in the Mekong and red river deltas-Cambodia and Vietnam. Sci. Tot. Environ., v.372, p.413-425. doi: 10.1016/j.scitotenv.2006.09.010
    Pubmed CrossRef
  3. Berg, M., Tran, H.C., Nguyen, T.C., Pham, H.V., Schertenleib, R., Giger, W. (2001) Arsenic contamination of groundwater and drinking water in Vietnam: a human health threat. Env. Sci. Tech., v.35, p.2621-2626. doi: 10.1021/es010027y
    Pubmed CrossRef
  4. Breiman, L. (2001) Random forests. Mach. Learn 45, 5-32. doi: https://doi.org/10.1023/A:1010933404324
    CrossRef
  5. Buschmann, J. and Berg, M. (2009) Impact of sulfate reduction on the scale of arsenic contamination in groundwater of the Mekong, Bengal and red river deltas. Appl. Geochem., v.24, p.1278-1286. doi: 10.1016/j.apgeochem.2009.04.002
    CrossRef
  6. Carrard, N., Foster, T. and Willetts, J. (2019) Groundwater as a source of drinking water in Southeast Asia and the Pacific: A multi-country review of current reliance and resource concerns. Water 2019 11, 1065. https://doi.org/10.3390/w11081605
    CrossRef
  7. Chen, T. and Guestrin, C. (2016) Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD, pp. 785-794. https://doi.org/10.1145/2939672.2939785
    CrossRef
  8. Choi, K.-W., Park, S.-S., Kang, C.-U., Lee, J.H. and Kim, S.J. (2021) A comparison study of alum sludge and ferric hydroxide based absorbents for arsenic adsorption from mine water. Econ. Environ. Geol., v.54, p.689-698. https://doi.org/10.9719/EEG.2021.54.6.689
    CrossRef
  9. Dramsch, J.S. (2020) 70 years of machine learning in geoscience in review. ADGEO. https://doi.org/10.1016/bs.agph.2020.08.002
    KoreaMed CrossRef
  10. Dreiseitl, S. and Ohno-Machado, L. (2002) Logistic regression and artificial neural network classification models: a methodology review. JBI, v.35, p.352-359. https://doi.org/10.1016/S1532-0464(03)00034-0
    CrossRef
  11. Fendorf, S., Michael, H.A. and van Geen, A. (2010) Spatial and temporal variations of groundwater arsenic in South and Southeast Asia. Science, v.328, p.1123-1127. doi: 10.1126/science.1172974
    Pubmed CrossRef
  12. Friedman, J.H. (2001) Greedy function approximation: a gradient boosting machine. Ann. Stat., p.1189-1232.
    CrossRef
  13. Guo, H., Li, X., Xiu, W., He, W., Cao, Y., Zhang, D. and Wang, A. (2019) Controls of organic matter bioreactivity on arsenic mobility in shallow aquifers of the Hetao Basin, P.R. China. Journal of Hydrology, v.571, p.448-459. https://doi.org/10.1016/j.jhydrol.2019.01.076.
    CrossRef
  14. Hastie, T., Tibshirani, R. and Friedman, J. (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. https://doi.org/10.1007/978-0-387-21606-5
    CrossRef
  15. Kinniburgh, D. (2001) Arsenic contamination of groundwater in Bangladesh. final report. http://www.bgs.ac. uk/research/groundwater/health/arsenic/Bangladesh/reports.
  16. Kwon, O.-H., Park, H.-S., Lee, J.S. and Ji, W.H. (2020) A field study on the application of pilot-scale vertical flow reactor system into the removal of Fe, As and Mn in mine drainage. Econ. and Environ. Geol., v.53, p.695-701. https://doi.org/10.9719/EEG.2020.53.6.695
  17. Lee, J.H., Ji, W.H., Lee, J.S., Park, S.S., Choi, K.W., Kang, C.U. and Kim, S.J. (2020) A study of fluoride and arsenic adsorption from aqueous solution using alum sludge based absorbent. Econ. Environ. Geol., v.53, p.667-675. https://doi.org/10.9719/EEG.2020.53.6.667
  18. Liu, X., Lai, X. and Zhang, L. (2020) A hierarchical missing value imputation method by correlation-based K-Nearest Neighbors. In: Bi Y., Bhatia R., Kapoor S. (eds.) Intelligent Systems and Applications. Advances in Intelligent Systems and Computing, 1037. Springer, Cham. https://doi.org/10.1007/978-3-030-29516-5_38.
    CrossRef
  19. Mantovani, R.G., Rossi, A.L., Vanschoren, J., Bischl, B. and De Carvalho, A.C. (2015) Effectiveness of random search in svm hyper-parameter tuning, in: 2015 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1-8. doi: 10.1109/IJCNN.2015.7280664
    KoreaMed CrossRef
  20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011) Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., v.12, p.2825-2830. https://dl.acm.org/doi/10.5555/1953048.2078195
  21. Podgorski, J. and Berg, M. (2020) Global threat of arsenic in groundwater. Science, v.368, p.845-850. doi: 10.1126/science.aba1510
    Pubmed CrossRef
  22. Podgorski, J., Wu, R., Chakravorty, B. and Polya, D.A. (2020) Groundwater arsenic distribution in India by machine learning geospatial modeling. Int. J. Environ. Res. Public Health, v.17, p.7119. https://doi.org/10.3390/ijerph17197119
    Pubmed KoreaMed CrossRef
  23. Podgorski, J.E., Labhasetwar, P., Saha, D. and Berg, M. (2018) Prediction modelling and mapping of groundwater fluoride contamination throughout India. Environ. Sci. Technol., v.52(17), p.9889-9898. https://doi.org/10.1021/acs.est.8b01679
    Pubmed CrossRef
  24. Postma, D., Larsen, F., Thai, N.T., Trang, P.T.K., Jakobsen, R., Nhan, P.Q., Long, T.V., Viet, P.H. and Murray, A.S. (2012) Groundwater arsenic concentrations in Vietnam controlled by sediment age. Nat. Geosci., v.5, p.656-661. https://doi.org/10.1038/ngeo1540
    CrossRef
  25. Ravenscroft, P., Brammer, H. and Richards, K. (2011) Arsenic pollution: a global synthesis. volume 94. John Wiley & Sons. Ravenscroft, P., Burgess, W.G., Ahmed, K.M., Burren, M. and Perrin, J. (2005) Arsenic in ground- water of the bengal basin, bangladesh: Distribution, field relations, and hydrogeological setting. Hydrogeol. J., v.13, p.727-751. doi: 10.1007/s10040-003-0314-0
    CrossRef
  26. Smith, R., Knight, R. and Fendorf, S. (2018) Overpumping leads to California groundwater arsenic threat. Nature Communications, v.9(2), p.115. doi: 10.1038/s41467-018-04475-3
    Pubmed KoreaMed CrossRef
  27. Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T. and Zeileis, A. (2008) Conditional variable importance for random forests. BMC Bioinformatics, v.9, p.307. doi: 10.1186/1471-2105-9-307
    Pubmed KoreaMed CrossRef
  28. Tan, Z., Yang, Q. and Zheng, Y. (2020) Machine learning models of groundwater arsenic spatial distribution in bangladesh: Influence of holocene sediment depositional history. Environ. Sci. Technol., v.54, p.9454-9463. https://doi.org/10.1021/acs.est.0c03617
    Pubmed CrossRef
  29. Touzani, S., Granderson, J. and Fernandes, S. (2018) Gradient boosting machine for modeling the energy consumption of commercial buildings. Energy and Buildings, v.158, p.1533-1543. https://doi.org/10.1016/j.enbuild.2017.11.039
    CrossRef
  30. United Nations, Department of Economic and Social Affairs Population Division (2019) World Population Prospects 2019, Volume II: Demographic Profiles Vietnam. Available at: https://population.un.org/wpp/Graphs/1_Demographic%20Profiles/Viet%20Nam.pdf.
  31. Wallis, I., Prommer, H., Berg, M., Siade, A.J., Sun, J. and Kipfer, R. (2020) The river-groundwater interface as a hotspot for arsenic release. Nature Geoscience, v.13, p.288-295. doi: 10.1038/s41561-020-0557-6
    CrossRef
  32. Winkel, L., Berg, M., Amini, M., Hug, S.J. and Johnson, C.A. (2008) Predicting groundwater arsenic contamination in Southeast Asia from surface parameters. Nat. Geosci., v.1, p.536-542.
    CrossRef
  33. Winkel, L.H., Trang, P.T.K., Lan, V.M., Stengel, C., Amini, M., Ha, N.T., Viet, P.H. and Berg, M. (2011) Arsenic pollution of groundwater in Vietnam exacerbated by deep aquifer exploitation for more than a century. PNAS, v.108, p.1246-1251. doi: 10.1073/pnas.1011915108
    Pubmed KoreaMed CrossRef
KSEEG
Dec 31, 2024 Vol.57 No.6, pp. 665~835

Stats or Metrics

Share this article on

  • kakao talk
  • line

Related articles in KSEEG

Economic and Environmental Geology

pISSN 1225-7281
eISSN 2288-7962
qr-code Download