Econ. Environ. Geol. 2024; 57(3): 329-342
Published online June 30, 2024
https://doi.org/10.9719/EEG.2024.57.3.329
© THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY
Correspondence to : *vellingiri.j@vit.ac.in
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.
The exponential increase in nitrate pollution of river water poses an immediate threat to public health and the environment. This contamination is primarily due to various human activities, which include the overuse of nitrogenous fertilizers in agriculture and the discharge of nitrate-rich industrial effluents into rivers. As a result, the accurate prediction and identification of contaminated areas has become a crucial and challenging task for researchers. To solve these problems, this work leads to the prediction of nitrate contamination using machine learning approaches. This paper presents a novel approach known as Grey Wolf Optimizer (GWO) based on the Stacked Ensemble approach for predicting nitrate pollution in the Cauvery Delta region of Tamilnadu, India. The proposed method is evaluated using a Cauvery River dataset from the Tamilnadu Pollution Control Board. The proposed method shows excellent performance, achieving an accuracy of 93.31%, a precision of 93%, a sensitivity of 97.53%, a specificity of 94.28%, an F1-score of 95.23%, and an ROC score of 95%. These impressive results underline the demonstration of the proposed method in accurately predicting nitrate pollution in river water and ultimately help to make informed decisions to tackle these critical environmental problems.
Keywords nitrate prediction, machine learning, stacked ensemble, decision tree, random forest
Nitrate contamination in river water occurs naturally and affects millions worldwide.
Machine learning algorithms were used to predict nitrate (NO3) contamination in river water.
The study utilized a grey wolf optimization (GWO) algorithm to select relevant features from the dataset.
Models were built using a stacked ensemble and four individual machine learning algorithms.
The GWO-stacked ensemble model outperformed the others in predicting NO3 river water contamination.
The health of billions of people worldwide faces a significant threat due to the extensive pollution of rivers with high levels of nitrogen compounds, particularly ammonia and nitrate (NO3) (Bagherzadeh et al., 2021). This critical issue arises from the regular consumption of river water, which often contains elevated nitrate (NO3) concentrations, posing serious health risks. Prolonged exposure to nitrate (NO3) in drinking water can result in a range of health conditions such as blue baby syndrome, diabetes, miscarriages, stomach cancer, and thyroid disorders (Yang et al., 2021). The detrimental impact of these health hazards is substantial, contributing to a significant portion of global diseases and cancers (Chen et al., 2017). As a result, researchers globally are actively exploring innovative approaches to address and mitigate the consequences of river water contamination (Kumar et al., 2020).
Tamil Nadu, a rapidly growing state projected to become the third most populous with over 8 million residents, faces significant water challenges. Keerthan et al. (2023) highlight that more than five million individuals in Tamil Nadu rely on the Cauvery River for their daily water requirements. However, the river water in many areas consistently exceeds the permissible nitrate (NO3) limit of 45 mg/L (Bis, 2012) throughout the year. The Cauvery River delta, a vital agricultural region, grapples with heightened nitrate (NO3) levels attributed to extensive nitrogen absorption from farming practices. Human-induced factors like agricultural runoff, sewage plant discharges, and nitrogenous waste oxidation in humans and animals are key contributors to the elevated nitrate (NO3) concentrations in the Cauvery Delta region.
The regions within the Cauvery River delta exhibiting elevated nitrate (NO3) levels also demonstrate increased concentrations of Ca, Cl, K, Mg, and Na, alongside reduced levels of SO4 (RamyaPriya et al., 2023; Tamilmani et al., 2023). Predicting nitrate (NO3) levels accurately in river systems poses a significant challenge for environmental engineers due to the complex interplay of various factors. In response to this challenge, recent advancements in machine learning and deep learning techniques have shown promise in environmental science risk prediction. These advanced techniques excel in unravelling intricate relationships within vast datasets, handling complex patterns, and adapting continuously, offering a more robust approach compared to traditional statistical methods.
Several machine-learning methods play a crucial role in predicting river water quality, including Artificial neural networks (He et al., 2011), Adaptive network-based fuzzy inference system (Azad et al., 2018), Decision Tree (Lu et al., 2022), Random Forest (Wheeler et al., 2015), and Support vector machines (Arabgol et al., 2016). Despite the effectiveness of these techniques in water quality prediction, their application in assessing nitrate (NO3) contamination risks remains limited, lacking an integrated approach. To address this gap, a novel framework is proposed in this study to comprehensively evaluate the risk of nitrate (NO3) pollution. The framework focuses on developing a water quality assessment system that predicts contamination by selecting significant features to enhance classification accuracy, improve detection quality, and reduce processing time. The feature selection process relies heavily on Grey Wolf Optimization (GWO) due to its robustness and ability to identify relevant features efficiently. GWO aligns well with practical engineering challenges as it is simple, fast, precise, and easy to implement (Sharma et al., 2023). Additionally, the study introduces stacked machine learning techniques to enhance the accuracy of nitrate (NO3)F contamination prediction, particularly when dealing with intricate datasets from diverse sources and incomplete information.
The main contributions of this study are discussed below.
• The primary aim is to introduce a novel machine-learning approach for predicting nitrate (NO3) pollution levels in the Cauvery Delta region.
• The method proposed in this study leverages a grey wolf optimization algorithm to select relevant features from the dataset.
• Stacking, an ensemble classifier machine learning technique, is employed for task classification. This approach combines predictions from multiple base learners to enhance prediction accuracy. Each base classifier is trained to predict the reference data class, and the final model prediction is generated by the meta-learner.
• Lastly, a comparative analysis was conducted between the proposed technique and state-of-the-art methods to showcase the algorithm's effectiveness.
The rest of the paper is structured as follows: Section 2 discusses techniques for predicting water quality. Section 3 gives an introduction to GWO with Feature selection and Stacked Ensemble for water quality prediction, while Section 4 presents the results and discussion. Section 5 concludes the article.
Many research investigations have been conducted to predict nitrate contamination in rivers in India and other countries. For example, Wagh et al. (2017) proposed a technique using ANN to predict nitrate concentrations in the Kadava River catchment They collected data from 40 groundwater monitoring wells in the Nashik district and achieved an R2 value of 0.75, indicating a good model performance. However, the dataset size is very small. Rodriguez-Galiano et al. (2018) developed a CART, RF, and SVM models to predict the relevance of characteristics associated with nitrate-related groundwater contamination. This research utilized data gathered from remote sensing technology. The Embedded, Filter, and Wrapper techniques are used to evaluate the importance of the feature. The RFSSFS method performed better than other methods, with an AUC of 0.92. However, this study is limited to a particular area of focus. Benzer et al. (2018) created an ANN model for predicting nitrate concentrations in surface waters in a river basin in Turkey. They gathered data from 30 stations in the Yeşilırmak Watershed. The ANN model successfully predicted nitrate levels for 2020 and 2030, staying within safe drinking water standards. However, the study's limitation is that it was not tested for applicability to other regions or different contaminants. Rahmati et al. (2019) used KNN, RF, and SVM models to estimate nitrate concentration in streams in the Andimeshk-Dezful region, Iran. They used data from 114 groundwater monitoring wells in Iran and found that their RF model outperformed traditional regression models, with an R2 of 0.72 and an RMSE of 10.41. The primary limitation of this study is based on the sampling of nitrate concentrations, assessing seasonal and interannual fluctuations in the concentrations. Knoll et al. (2019) studied different artificial intelligence methods to predict nitrate levels in groundwater in Hesse, Germany. They found that a combination of machine learning models using GBR performed best, with an R2 of 0.75 and RMSE of 9.38 mg/l, surpassing individual models like RF, SVR, and KNN. The findings offer useful tools for water managers to forecast and control groundwater nitrate pollution, supporting environmental planning and sustainable groundwater management. However, the system they developed did not enhance accuracy. Jafari et al. (2019) created four machine learning models to forecast Total Dissolved Solids (TDS) in the Tabriz plain aquifer. These models, including ANFIS, SVM, MLP, and GEP, were trained on a dataset of 1742 groundwater samples collected from 2002 to 2012, which included various physicochemical parameters. The GEP model outperformed the others with the lowest RMSE (58.93) and the highest correlation coefficient (R = 0.998), indicating a very accurate prediction of TDS values. However, these machine learning techniques did not reduce the time complexity as expected.
Band et al. (2020) studied four machine learning models (BANN, Cubist, RF, and SVM) to predict nitrate levels in the Marvdasht watershed, Iran. They analyzed data from 67 groundwater monitoring wells and discovered that the RF model outperformed other methods with an R2 of 0.89, compared to Cubist (0.87), SVM (0.74), and Bayesian-ANN (0.79). Bedi et al. (2020) compared three ML methods (ANN, XGB, and SVM) for predicting nitrate and pesticide contamination in agricultural groundwater resources. The models were assessed using a dataset consisting of 303 wells across 12 Midwestern states in the USA. The XGB model performed the best, with an RMSE value of 3.91. However, a significant limitation of this study is the scarcity of labeled data for training advanced models, which poses a challenge that requires attention in the future. Hà et al. (2020) designed an RF model to estimate nitrate and phosphorus concentrations in the Tri An reservoir. They gathered data every two months from 2009 to 2014, including parameters like TSS, TDS, COD, BOD5, EC, and turbidity. The findings demonstrated the RF model outperformed the traditional statistical methods, with an R2 value of 0.92. However, the study's limitation is that it focused solely on the specific region of the Tri An reservoir. Latif et al. (2020) developed an ANN model to forecast nitrate levels in the Feitsui reservoir in Taiwan. They input dissolved oxygen (DO), ammonium (NH3), phosphate (PO4), nitrogen dioxide (NO2), and nitrate (NO3) parameters. The study revealed that the ANN model outperformed conventional methods, achieving an accuracy score of 0.94. However, a limitation of this study was the use of only five parameters. Stamenković et al. (2020) developed ANN and MLR models to predict the concentration of nitrates in river water. They used data from ten monitoring stations along the Danube River in Serbia from 2011 to 2016. The ANN models demonstrated good predictive ability, with a mean absolute error of 0.53 and 0.42 mg/L for the test data. However, a limitation of the study was that out of 26 parameters, only 8 showed significant deviations from the skewness and kurtosis limit values.
Alizamir et al. (2021) introduced a hybrid Bat-ELM model to forecast daily chlorophyll-a (Chl-a) levels in rivers. They used data from two USGS stations with input variables such as turbidity, pH, specific conductance, water temperature, and periodicity. The model achieved an R2 value of 0.89. However, the key factors influencing chlorophyll-a concentration can differ based on the particular ecosystem. Pham et al. (2021) used three machine learning methods (ANN, ANFIS, and GMDH) to estimate Water Quality Index (WQI) in surface wetlands. They monitored water quality parameters like conductivity, suspended solids, BOD, ammonia, COD, dissolved oxygen, temperature, pH, phosphate, nitrite, and nitrate at seventeen wetland points over 14 months. The ANFIS method performed the best, with a low MAE of 0.0219 and a high NSE of 0.96. However, deep neural networks did not incorporate prior knowledge effectively, leading to lower prediction accuracy and longer processing times. Lu et al. (2022) developed GBRT, LSTM, and RF to predict total phosphorus and nitrogen concentrations in Taihu Lake. The data used for the study on nitrogen and phosphorus levels in Taihu Lake was gathered between 2011 and 2018, focusing on the highest monthly amounts of these substances within the lake. Results showed that the LSTM performed better compared to other models based on the RMSE value of 0.11. Ottong et al. (2022) introduced four machine learning models (LR, SVM, RF, GBM) to forecast arsenic contamination risk in the Red River Delta. They used 512 data points with 38 hadrochemical parameters from 2005 to 2007. The topperforming model was GBM, achieving high accuracy, precision, sensitivity, and specificity at 98.7%, 100%, 95.2%, and 100% respectively. One drawback of this study is the limited amount of data. Having more data is important for creating improved models that can better adapt to different situations.
Hu et al. (2023) developed an XGB model to forecast nitrogen and phosphorus levels in Taihu lakes using 13 years of historical data. The model utilized water quality and meteorological data, achieving R2 values of 0.91 and 0.95. Sulaiman et al. (2023) compared seven machine learning models to predict nitrate concentrations for spectroscopic dataset, with the RF-PCA hybrid method performing the best at 92.7% accuracy. However, they did not choose specific features to simplify the prediction process and reduce spatial complexity. Liang et al. (2024) developed four machine learning models (GB, RF, XGB, AD) to predict nitrogen levels in Chongqing city. They analyzed 595 groundwater samples using various predictors like topography, remote sensing, hydrogeological data, climate factors, nitrate input, and socio-economic information. The GB model performed the best with an R2 of 0.627, MAE of 0.529, RMSE of 0.705, and PICP of 0.924. However, the study's limitation is that it was tested in a small area. Mehdaoui et al. (2024) introduced MLR and RBF-NN models to forecast nitrate levels in the Cheliff basin. They analyzed monthly data over a 10-year period. The RBF-NN model performed the best with an impressive accuracy of R2=0.957. One significant limitation is that the RBF-NN model is specifically suitable for this particular location. The overall literature review is summarized in Table 1, which organizes multiple articles based on selected criteria in a concise manner.
Table 1 Literature Survey
Paper | Machine Learning Mode | Performance Metrics | Key Findings | Limitation |
---|---|---|---|---|
Wagh et al. (2017) | ANN | R2= 0.75 | ANN model outperformed other methods in predicting nitrate concentrations in the Kadava River catchment | Small dataset size |
Rodriguez-Galiano et al. (2018) | CART, RF, SVM | AUC = 0.92 | RF-SSFS method outperformed others in nitrate-related groundwater contamination | Limited to a specific area |
Benzer et al. (2018) | ANN | Accuracy = 96 | ANN model effectively predicted nitrate concentrations in surface waters in a river basin in China. | The application of the model has not tested in the other regions. |
Rahmati et al. (2019) | KNN, RF, SVM | R2 = 0.72, RMSE = 10.41 | RF model outperformed traditional regression models in estimating nitrate concentration in streams in Iran. | The model depends on the assessing seasonal and interannual fluctuations of the nitrate concentrations. |
Knoll et al. (2019) | GBR, CART, MLR, RF | R2= 0.75 | GBR could more accurately estimate nitrate levels | The System did not enhance accuracy |
Jafari et al. (2019) | ANFIS, SVM, MLP, GEP | RMSE = 58.93, R = 0.998 | GEP model provided accurate TDS prediction in Tabriz plain aquifer | Machine learning techniques did not reduce time complexity |
Band et al. (2020) | BANN, Cubist, RF, SVM | R2 = 0.89, | RF model outperformed others in Marvdasht watershed, Iran | Limited to a specific region |
Bedi et al. (2020) | ANN, XGB, SVM | RMSE = 3.91 | XGB model excelled in predicting nitrate and pesticide contamination | Scarcity of labeled data for training advanced models. |
Hà et al. (2020) | RF | R2 = 0.92 | RF model performed better in estimating nitrate and phosphorus concentrations in Tri An reservoir | Focused solely on the Tri An reservoir. |
Latif et al. (2020) | ANN | Accuracy score = 0.94 | ANN model was superior in forecasting nitrate levels in Feitsui reservoir, Taiwan | Limited to five input Parameter |
Stamenković et al. (2020) | ANN, MLR | MAE = 0.53 | ANN models showed good predictive ability for nitrates in river water | Limited significant deviations in parameters |
Alizamir et al. (2021) | Hybrid Bat-ELM | R2= 0.89 | Hybrid model effectively predicted daily chlorophyll-a concentration in rivers. | Key factors influencing chlorophyll-a concentration may vary by ecosystem |
Pham et al. (2021) | ANN, ANFIS, GMDH | MAE = 0.0120.0219, NSE = 0.96 | ANFIS method excelled in estimating Water Quality Index in surface wetlands | Deep neural networks lacked effective incorporation of prior knowledge |
Lu et al. (2022) | GBRT, LSTM, RF | RMSE = 0.11 | LSTM model performed best in predicting total phosphorus and nitrogen concentrations in Taihu Lake | Focused on monthly data, limited temporal scope |
Ottong et al. (2022) | LR, SVM, RF, GBM | Accuracy = 87%, Precision = 100%, Sensitivity = 95.2%, Specificity = 100% | GBM model effectively forecasted arsenic contamination risk in the Red River Delta | Limited data points for model training |
Hu et al. (2023) | XGB | R2= 0.91 | XGB model effectively predicted nitrogen and phosphorus concentrations in Taihu lakes. | - |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | Accuracy: 92.8% | RF-PCA hybrid method outperformed other models in predicting nitrate concentrations for hydroponic plants. | Limited Input Size |
Liang et al. (2024) | GB | R2= 0.627, MAE: 0.529, RMSE: 0.705 | Developed models to predict nitrogen levels in Chongqing city using various predictors | Tested in a small area |
Mehdaoui et al. (2024) | RBF-NN | Accuracy = 0.957 | Introduced MLR and RBF-NN models to forecast nitrate levels in the Cheliff basin | This model specifically suitable for this location |
The application of machine learning techniques to predict water pollution has been unsuccessful in many situations, as mentioned in the literature review section. In this study, a novel machine learning technique known as GWO-stacked ensemble learning is applied to forecast nitrate contamination, which is described below. The main objective of this work is to improve the accuracy and speed of nitrate contamination prediction by using stacked ensemble learning approaches. Stack generalization is an approach that allows researchers to combine several prediction algorithms into one. Figure 1 depicts the workflow for this study. There are various steps to the experiment. First, the Tamilnadu Pollution Control Board (TNPCB) provided the dataset. It incorporates all the important water quality indicators. Water-quality data from the Cauvery River was used in this study. Typically, data pre-processing entails converting raw data into an informative format. This is a very crucial stage because datasets may contain errors, missing data, data redundancy, and noise. To solve the above issue, data pre-processing steps might be required. The next phase involves extracting relevant features via feature selection approaches using GWO. The advantages of feature selection include improving prediction accuracy, removing duplicate data from the dataset, and reducing the number of features without losing essential information. The next section compares several machine learning models, such as DT, KNN, MLP, and RF. Because each model has different classification skills, selecting the best-combined models is a difficult task in the research process. Finally, the results were assessed using several performance metrics in terms of accuracy, precision, sensitivity, specificity, F1-score, ROC, and MCC values.
This water quality dataset of the Cauvery River was collected by the TNPCB between 2018 and 2019. The dataset contains 792 samples and 26 features, respectively. The samples were taken at 33 monitoring sites in the Cauvery River catchment area. The water quality characteristics are described in Table 2. In addition, the Z-score normalization technique is used in the data pre-processing step, which improves the quality of the dataset. Data cleaning and labeling are two steps that need to be performed before using the data. A 70–30 train-test output validation scheme was used to ensure the reliability of our test.
Table 2 Attribute of water quality dataset
Variable | Description | Bureau of Indian Standard |
---|---|---|
NO3 | Nitrate | 10 |
Ph | Potential of Hydrogen | 6.5-8.5 |
Cl | Chloride | 250 |
BOD | Biological oxygen demand | Not mentioned |
DO | Dissolved Oxygen | Not mentioned |
FC | Fecal coliforms | 0.2 |
TC | Total coliforms | Not mentioned |
Tu | Turbidity | Not mentioned |
Pa | Phenolphthalein Alkalinity | Not mentioned |
Tal | Total Alkalinity | 200 |
EC | Electrical conductivity | Not mentioned |
N | Nitrogen | 4 |
COD | Chemical Oxygen Demand | Not mentioned |
NH3 | Ammonia | 50 |
Ca | Calcium | 75 |
Th | Total hardness | 300 |
K | Potassium | 0.4 |
Mg | Magnesium | 30 |
S04 | Sulphate | 200 |
Na | Sodium | 4 |
TDS | Total Dissolved Solids | 500 |
PO4 | Phosphate | Not mentioned |
TFS | Total Fixed Solids | 500 |
Br | Boron | 0.3 |
TSS | Total Suspended Solids | 500 |
F | Fluoride | 1 |
Grey wolf optimization (GWO) was proposed by Saitali Mirzali et al. (2014) and is more successful than other optimization algorithms such as differential evolution (DE), gravity search algorithm (GSA), genetic algorithm (GA), and particle swarm optimization (PSO). GWO has been applied in many real-world applications because of its superior search ability and its use of three solutions to generate an optimal global solution (Ullah et al., 2022). This algorithm is used in a variety of applications, including wind turbines (Yang et al., 2017), feature selection (Al-Tashi et al., 2019), and image classification (Raju et al., 2018).
The algorithm is based on the social hierarchy and hunting behavior of grey wolves in the wild. The grey wolf pack has a rigid social structure comprising alpha (α), beta (β), delta (δ), and omega (Figure 2). As pack leader, the alpha wolf assigns tasks to the other wolves. The beta wolf acts as a bridge between the alpha wolf and the other wolves in the pack, and its position can help the other wolves explore new regions in the search space. Delta wolves are called the heart of the pack, and their main job is hunting. The Omega wolves are at the bottom of the swarm and mostly serve as babysitters. Figure 3 is a flowchart explaining the operation of GWO.
The Grey wolf position vector may be defined as
In GWO, the hunting process behavior is described as follows
Where z= current iteration,
Where s1 and s2 are randomly initialized variables and represent a decrease in iteration from 2 to 0.
The presence of alpha, beta, and delta wolves in the hunting area has caused the status of grey wolves to be adjusted according to their relative positions to these wolves. Figure 4 illustrates the updated status of grey wolves in the hunting section.
where
The machine learning techniques DT, MLP, KNN, RF, and Stacked ensemble were used to predict water quality to accomplish this objective.
Decision Tree (DT): -The DT has three distinct components - an inner node, a branch node, and a leaf node - that function similarly to a traditional tree. Each inner node acts as a test variable, each branch indicates the result of the test, and each leaf node contains the class label. The entropy technique is employed to select the variable that will serve as the root of a decision tree. The tree is then divided into multiple subsets based on the values of the test attributes. This recursive approach is performed for each subset until they are all resolved. This recursive partitioning procedure separates the population into subpopulations depending on dichotomous variables, yielding a decision tree that appropriately identifies each person. (Myles et al., 2004).
KNN Algorithms: -KNN is a sluggish machine learning technique that can be utilized for classification and regression problems. This algorithm is widely used in data mining, pattern recognition, and intrusion detection. This approach uses distance calculations to provide unique predictions based on data that has been observed. The most commonly used methods for this calculation are the Euclidean distance, the Mahalanobis distance, and the cityblock distance. The K number of points is usually determined by how close the test data is to the known points. The advantage of KNN classification is its simplicity and non-parameter.
Multi-layer perceptron (MLP): -MLP is a kind of feed-forward ANN comprising a single-layer perceptron. An entry layer, a hidden layer, and an exit layer are three components used to create MLP. MLP has been used as a front propagation learning technique to transmit data from an input node to an output node. The learning capacity of MLP is determined by connection weights. The performance of the network increases over time by repeatedly adjusting the connection weights (Atangana et al., 2020; Joy et al., 2020). MLP is a supervised ML technique that is mostly used to classify patterns (Guo et al., 2020).
Random Forest: -RF is a supervised type of ML technique for regression and classification. It comprises several decision trees that depend on either the bagging or bootstrap aggregating approach. Random forest is used in ensemble learning techniques to solve complex problems and increase accuracy by merging individual models. The overall vote of all trees determines the final classification outcome (Chen et al., 2020).
Stacked is an ensemble learning approach that involves the integration of multiple base models to improve the overall prediction of machine learning. It is a higher-level approach to combine models compared to techniques, such as bagging and boosting, which focus on creating multiple models with different random subsets of the data or modifying the weights of training examples. The basic idea behind stacking is to use a set of diverse base models that are trained on different subsets of the data, using different algorithms and hyperparameters. Each base model makes its predictions, which are then combined with the meta-model to produce a final output. Figure 5 depicts the general form of the proposed stacked ensemble model. Random forest, Multiple-layer perception. Decision tree and KNN are the models used in the research study.
The pseudo-code for the stacked ensemble technique is given below
The following technique was performed for a hybrid GWO-stacked ensemble
Step 1: Collect the Cauvery River data from the Tamilnadu Pollution Control Board.
Step 2: Data pre-processing techniques are implemented using Z score normalization.
Step 3: The GWO feature selection approach is used to extract the essential features from the dataset.
Step 4: Divide the dataset into train and test sets.
Step 5: Training samples are analyzed using the stacked ensemble classification algorithm.
Step 6: The trained classifier is used on experimental data samples to predict whether NO3 contamination is at an acceptable level or not.
Step 7: Finally, the results recommend a suitable model for the prediction of NO3.
All experiments in this study were conducted with Python using the Jupyter Notebook framework on a Dell laptop, Intel Core™ i5-10210U CPU @ 1.60 GHz and 16 GB RAM. Pandas, NumPy, and Matplotlib libraries were used. The performance of the GWO-stacked model was evaluated using the following metrics, represented mathematically as follows.
Feature selection is a crucial step in the machine learning pipeline. It involves selecting a subset of relevant features from the original dataset to improve the performance and interpretability of the model. To show the superiority of the proposed GWO stacked ensemble method, four standalone ML models were also tested and used to compare their performances with those of the GWO stacked ensemble. To perform sensitivity analyses faster, the results were experimentally performed considering different input variables, i.e. BOD, Ca, Cl, K, Mg, Na, NH3, N, and S04 the best prediction accuracies were obtained with the GWO stacked ensemble.
The experimental results of the GWO stacked ensemble method are evaluated in comparison with different machine learning techniques such as DT, KNN, MLP, and RF. The GWO stacked ensemble method is tested in the Python environment using the Cauvery River data obtained from TNPCB. The results of the confusion matrix of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) are shown in Table 3. This value was used to determine the performance of a classification on test data. The GWO stacked ensemble method had the highest TP and the lowest FN of 76 and 6, respectively. In addition, the negative results were also perfectly predicted with a TN and FP of 76 and 4, respectively. This indicates that the GWO stacked ensemble technique has the best hyperparameters compared to other methods.
Table 3 Confusion matrix result for test data
S.NO | Classifier | TP | FP | FN | TN | Accuracy |
---|---|---|---|---|---|---|
1 | GWO-Stacked Ensemble (Proposed) | 76 | 4 | 6 | 75 | 0.93 |
2 | RF | 73 | 6 | 7 | 76 | 0.90 |
3 | DT | 71 | 8 | 9 | 73 | 0.88 |
4 | MLP | 72 | 10 | 10 | 69 | 0.87 |
5 | KNN | 70 | 10 | 11 | 69 | 0.86 |
Table 4 shows that our proposed model performs well compared to all other models in terms of other performance parameters. For the Matthew coefficient, the RF classifier achieved the second-highest score of 80%. The findings demonstrated no significant differences between DT and MLP, with precision and specificity ratings of 86% and 83%, respectively. However, the KNN classifier has the lowest accuracy, sensitivity, F1-score, and ROC scores, with values of 85%, 89%, 86%, and 90%, respectively. Figure 6 shows the performance comparison of each model. In this figure, the GWO-stacked model outperforms other ML models in many situations. Figure 3 shows the ROC curve of the predictive performance of all models. According to the graph, the GWO-stacked reached the maximum value of 0.95.
Table 4 Comparison of the proposed model with the base classifiers
Classifier | Accuracy | Precision | Sensitivity | Specificity | F1-Score | MCC |
---|---|---|---|---|---|---|
GWO-Stacked Ensemble (Proposed) | 0.93 | 0.95 | 0.92 | 0.92 | 0.93 | 0.85 |
RF | 0.90 | 0.92 | 0.91 | 0.90 | 0.91 | 0.83 |
DT | 0.88 | 0.89 | 0.88 | 0.89 | 0.89 | 0.78 |
MLP | 0.87 | 0.87 | 0.86 | 0.85 | 0.87 | 0.74 |
KNN | 0.86 | 0.87 | 0.87 | 0.86 | 0.86 | 0.72 |
Table 5 presents a performance assessment of the proposed method with base classifier methods to select the best model for predicting nitrate contamination. The optimal model is determined by dividing the data into test and training ranges. The ranges vary from 60%–40% to 90%– 10%. The performance is assessed using different evaluation metrics. Table 5 demonstrates the performance of the GWO-stacked algorithm compared to other studies in this field when the data is split at a ratio of 70:30. The outcomes of the GWO-stacked method are assessed using the seven metrics listed above, and this information is used to determine the best data separation threshold for predicting nitrate pollution in river water.
Table 5 Comparison of the performance of the proposed methods against the basic classifier using data splitting validation
Metrics | Data Split Ratio | Classification Method | ||||
---|---|---|---|---|---|---|
GWO-Stacked (Proposed) | RF | DT | MLP | KNN | ||
Accuracy | 60-40 | 91.5 | 88 | 86.5 | 85.3 | 83.55 |
70-30 | 93.21 | 89.78 | 88 | 87.65 | 85.52 | |
80-20 | 92.98 | 89.16 | 87.05 | 87.23 | 84.89 | |
90-10 | 91.94 | 88.94 | 86.83 | 85.32 | 82.94 | |
Precision | 60-40 | 91.56 | 85.09 | 84.78 | 84.35 | 81.36 |
70-30 | 93 | 87.26 | 86.72 | 86.72 | 83.88 | |
80-20 | 92.13 | 86.58 | 85.09 | 85.39 | 81.36 | |
90-10 | 91.04 | 85.98 | 84.21 | 83.96 | 80 | |
Sensitivity | 60-40 | 95.13 | 91.89 | 91.27 | 89.28 | 87.18 |
70-30 | 97.53 | 93.33 | 93 | 91.86 | 89.10 | |
80-20 | 96.41 | 91.56 | 91.04 | 89.74 | 88.72 | |
90-10 | 95.23 | 90.86 | 90.12 | 88.21 | 87.23 | |
Specificity | 60-40 | 86.94 | 82.92 | 81.89 | 81.11 | 78.27 |
70-30 | 88.56 | 84.23 | 83.98 | 83.15 | 80 | |
80-20 | 87.88 | 83.55 | 82.45 | 82.18 | 78.89 | |
90-10 | 86.52 | 82.10 | 81.89 | 81.23 | 77.25 | |
F1-Score | 60-40 | 92.41 | 88.72 | 86.94 | 85.36 | 85.10 |
70-30 | 94.28 | 90 | 89.56 | 88.78 | 86.23 | |
80-20 | 93.88 | 88.72 | 87.90 | 87.45 | 85.63 | |
90-10 | 92.58 | 86.28 | 85.92 | 85.11 | 84.23 | |
ROC | 60-40 | 93.88 | 92.12 | 90.25 | 89.41 | 88.23 |
70-30 | 95.23 | 94.15 | 92.10 | 91 | 90.05 | |
80-20 | 94.73 | 93.78 | 91.65 | 89.41 | 88.14 | |
90-10 | 93.12 | 92.89 | 92.10 | 88.11 | 87.23 | |
MCC | 60-40 | 81.56 | 77.89 | 76.23 | 75.96 | 72.18 |
70-30 | 83 | 79.52 | 78.36 | 75.12 | 74.89 | |
80-20 | 82.16 | 78.23 | 77.65 | 73.96 | 73.98 | |
90-10 | 81.23 | 77.23 | 76.63 | 72.13 | 71.08 |
Table 6 compares the accuracy of various advanced techniques with the proposed system, showing that GWOStacked had the highest accuracy in predicting nitrate concentration. Latif et al. (2020) achieved the highest accuracy of 93% using ANN, while Sulaiman et al. (2020) and Bhattarai et al. (2021) reached 92.8% accuracy each. On the other hand, Alizamir et al. (2021) and Knoll et al. (2019) had models with less than 90% accuracy in predicting nitrate concentration.
Table 6 Shows a comparison of performance between the Proposed method and existing research
Author | Model | Accuracy (%) |
---|---|---|
Proposed Model | GWO-Stacked Ensemble | 93 |
Latif et al. (2020) | ANN | 93 |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | 92.8 |
Bhattarai et al. (2021) | KNN, NB, RF, GB, SVM | 92.8 |
Alizamir et al. (2021) | Hybrid Bat-ELM | 89 |
Knoll et al. (2019) | GBR, CART, MLR, RF | 75 |
In this study, a machine learning approach called the GWO-stacked ensemble is proposed for predicting nitrogen pollution in the Cauvery River Delta region. The model involves data preprocessing to handle missing values and normalization, followed by feature selection using the grey wolf optimization technique. This method efficiently selects relevant features for input into the stacked ensemble algorithm, which mitigates issues like variance and overfitting seen in single-classifier models. The GWO-stacked ensemble outperformed DT, RF, MLP, and KNN models with an accuracy of 93%, precision of 93%, sensitivity of 97%, specificity of 88%, and F1-score of 94%. The ROC curve accuracy was highest at 95% with this technique. The research though it achieved its goals is limited by its reliance on a few factors. This narrow focus helps forecast levels even when data is scarce enhancing the usefulness of the models. Therefore, it's important for future studies to identify factors that could enhance the power of machine learning algorithms in this specific field.
The data that support the findings of this study are available from the corresponding author, [Vellingiri. J], upon reasonable request.
No potential conflict of interest was reported by the authors.
Econ. Environ. Geol. 2024; 57(3): 329-342
Published online June 30, 2024 https://doi.org/10.9719/EEG.2024.57.3.329
Copyright © THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY.
Kalaivanan K, Vellingiri J*
School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore-632014, India
Correspondence to:*vellingiri.j@vit.ac.in
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.
The exponential increase in nitrate pollution of river water poses an immediate threat to public health and the environment. This contamination is primarily due to various human activities, which include the overuse of nitrogenous fertilizers in agriculture and the discharge of nitrate-rich industrial effluents into rivers. As a result, the accurate prediction and identification of contaminated areas has become a crucial and challenging task for researchers. To solve these problems, this work leads to the prediction of nitrate contamination using machine learning approaches. This paper presents a novel approach known as Grey Wolf Optimizer (GWO) based on the Stacked Ensemble approach for predicting nitrate pollution in the Cauvery Delta region of Tamilnadu, India. The proposed method is evaluated using a Cauvery River dataset from the Tamilnadu Pollution Control Board. The proposed method shows excellent performance, achieving an accuracy of 93.31%, a precision of 93%, a sensitivity of 97.53%, a specificity of 94.28%, an F1-score of 95.23%, and an ROC score of 95%. These impressive results underline the demonstration of the proposed method in accurately predicting nitrate pollution in river water and ultimately help to make informed decisions to tackle these critical environmental problems.
Keywords nitrate prediction, machine learning, stacked ensemble, decision tree, random forest
Nitrate contamination in river water occurs naturally and affects millions worldwide.
Machine learning algorithms were used to predict nitrate (NO3) contamination in river water.
The study utilized a grey wolf optimization (GWO) algorithm to select relevant features from the dataset.
Models were built using a stacked ensemble and four individual machine learning algorithms.
The GWO-stacked ensemble model outperformed the others in predicting NO3 river water contamination.
The health of billions of people worldwide faces a significant threat due to the extensive pollution of rivers with high levels of nitrogen compounds, particularly ammonia and nitrate (NO3) (Bagherzadeh et al., 2021). This critical issue arises from the regular consumption of river water, which often contains elevated nitrate (NO3) concentrations, posing serious health risks. Prolonged exposure to nitrate (NO3) in drinking water can result in a range of health conditions such as blue baby syndrome, diabetes, miscarriages, stomach cancer, and thyroid disorders (Yang et al., 2021). The detrimental impact of these health hazards is substantial, contributing to a significant portion of global diseases and cancers (Chen et al., 2017). As a result, researchers globally are actively exploring innovative approaches to address and mitigate the consequences of river water contamination (Kumar et al., 2020).
Tamil Nadu, a rapidly growing state projected to become the third most populous with over 8 million residents, faces significant water challenges. Keerthan et al. (2023) highlight that more than five million individuals in Tamil Nadu rely on the Cauvery River for their daily water requirements. However, the river water in many areas consistently exceeds the permissible nitrate (NO3) limit of 45 mg/L (Bis, 2012) throughout the year. The Cauvery River delta, a vital agricultural region, grapples with heightened nitrate (NO3) levels attributed to extensive nitrogen absorption from farming practices. Human-induced factors like agricultural runoff, sewage plant discharges, and nitrogenous waste oxidation in humans and animals are key contributors to the elevated nitrate (NO3) concentrations in the Cauvery Delta region.
The regions within the Cauvery River delta exhibiting elevated nitrate (NO3) levels also demonstrate increased concentrations of Ca, Cl, K, Mg, and Na, alongside reduced levels of SO4 (RamyaPriya et al., 2023; Tamilmani et al., 2023). Predicting nitrate (NO3) levels accurately in river systems poses a significant challenge for environmental engineers due to the complex interplay of various factors. In response to this challenge, recent advancements in machine learning and deep learning techniques have shown promise in environmental science risk prediction. These advanced techniques excel in unravelling intricate relationships within vast datasets, handling complex patterns, and adapting continuously, offering a more robust approach compared to traditional statistical methods.
Several machine-learning methods play a crucial role in predicting river water quality, including Artificial neural networks (He et al., 2011), Adaptive network-based fuzzy inference system (Azad et al., 2018), Decision Tree (Lu et al., 2022), Random Forest (Wheeler et al., 2015), and Support vector machines (Arabgol et al., 2016). Despite the effectiveness of these techniques in water quality prediction, their application in assessing nitrate (NO3) contamination risks remains limited, lacking an integrated approach. To address this gap, a novel framework is proposed in this study to comprehensively evaluate the risk of nitrate (NO3) pollution. The framework focuses on developing a water quality assessment system that predicts contamination by selecting significant features to enhance classification accuracy, improve detection quality, and reduce processing time. The feature selection process relies heavily on Grey Wolf Optimization (GWO) due to its robustness and ability to identify relevant features efficiently. GWO aligns well with practical engineering challenges as it is simple, fast, precise, and easy to implement (Sharma et al., 2023). Additionally, the study introduces stacked machine learning techniques to enhance the accuracy of nitrate (NO3)F contamination prediction, particularly when dealing with intricate datasets from diverse sources and incomplete information.
The main contributions of this study are discussed below.
• The primary aim is to introduce a novel machine-learning approach for predicting nitrate (NO3) pollution levels in the Cauvery Delta region.
• The method proposed in this study leverages a grey wolf optimization algorithm to select relevant features from the dataset.
• Stacking, an ensemble classifier machine learning technique, is employed for task classification. This approach combines predictions from multiple base learners to enhance prediction accuracy. Each base classifier is trained to predict the reference data class, and the final model prediction is generated by the meta-learner.
• Lastly, a comparative analysis was conducted between the proposed technique and state-of-the-art methods to showcase the algorithm's effectiveness.
The rest of the paper is structured as follows: Section 2 discusses techniques for predicting water quality. Section 3 gives an introduction to GWO with Feature selection and Stacked Ensemble for water quality prediction, while Section 4 presents the results and discussion. Section 5 concludes the article.
Many research investigations have been conducted to predict nitrate contamination in rivers in India and other countries. For example, Wagh et al. (2017) proposed a technique using ANN to predict nitrate concentrations in the Kadava River catchment They collected data from 40 groundwater monitoring wells in the Nashik district and achieved an R2 value of 0.75, indicating a good model performance. However, the dataset size is very small. Rodriguez-Galiano et al. (2018) developed a CART, RF, and SVM models to predict the relevance of characteristics associated with nitrate-related groundwater contamination. This research utilized data gathered from remote sensing technology. The Embedded, Filter, and Wrapper techniques are used to evaluate the importance of the feature. The RFSSFS method performed better than other methods, with an AUC of 0.92. However, this study is limited to a particular area of focus. Benzer et al. (2018) created an ANN model for predicting nitrate concentrations in surface waters in a river basin in Turkey. They gathered data from 30 stations in the Yeşilırmak Watershed. The ANN model successfully predicted nitrate levels for 2020 and 2030, staying within safe drinking water standards. However, the study's limitation is that it was not tested for applicability to other regions or different contaminants. Rahmati et al. (2019) used KNN, RF, and SVM models to estimate nitrate concentration in streams in the Andimeshk-Dezful region, Iran. They used data from 114 groundwater monitoring wells in Iran and found that their RF model outperformed traditional regression models, with an R2 of 0.72 and an RMSE of 10.41. The primary limitation of this study is based on the sampling of nitrate concentrations, assessing seasonal and interannual fluctuations in the concentrations. Knoll et al. (2019) studied different artificial intelligence methods to predict nitrate levels in groundwater in Hesse, Germany. They found that a combination of machine learning models using GBR performed best, with an R2 of 0.75 and RMSE of 9.38 mg/l, surpassing individual models like RF, SVR, and KNN. The findings offer useful tools for water managers to forecast and control groundwater nitrate pollution, supporting environmental planning and sustainable groundwater management. However, the system they developed did not enhance accuracy. Jafari et al. (2019) created four machine learning models to forecast Total Dissolved Solids (TDS) in the Tabriz plain aquifer. These models, including ANFIS, SVM, MLP, and GEP, were trained on a dataset of 1742 groundwater samples collected from 2002 to 2012, which included various physicochemical parameters. The GEP model outperformed the others with the lowest RMSE (58.93) and the highest correlation coefficient (R = 0.998), indicating a very accurate prediction of TDS values. However, these machine learning techniques did not reduce the time complexity as expected.
Band et al. (2020) studied four machine learning models (BANN, Cubist, RF, and SVM) to predict nitrate levels in the Marvdasht watershed, Iran. They analyzed data from 67 groundwater monitoring wells and discovered that the RF model outperformed other methods with an R2 of 0.89, compared to Cubist (0.87), SVM (0.74), and Bayesian-ANN (0.79). Bedi et al. (2020) compared three ML methods (ANN, XGB, and SVM) for predicting nitrate and pesticide contamination in agricultural groundwater resources. The models were assessed using a dataset consisting of 303 wells across 12 Midwestern states in the USA. The XGB model performed the best, with an RMSE value of 3.91. However, a significant limitation of this study is the scarcity of labeled data for training advanced models, which poses a challenge that requires attention in the future. Hà et al. (2020) designed an RF model to estimate nitrate and phosphorus concentrations in the Tri An reservoir. They gathered data every two months from 2009 to 2014, including parameters like TSS, TDS, COD, BOD5, EC, and turbidity. The findings demonstrated the RF model outperformed the traditional statistical methods, with an R2 value of 0.92. However, the study's limitation is that it focused solely on the specific region of the Tri An reservoir. Latif et al. (2020) developed an ANN model to forecast nitrate levels in the Feitsui reservoir in Taiwan. They input dissolved oxygen (DO), ammonium (NH3), phosphate (PO4), nitrogen dioxide (NO2), and nitrate (NO3) parameters. The study revealed that the ANN model outperformed conventional methods, achieving an accuracy score of 0.94. However, a limitation of this study was the use of only five parameters. Stamenković et al. (2020) developed ANN and MLR models to predict the concentration of nitrates in river water. They used data from ten monitoring stations along the Danube River in Serbia from 2011 to 2016. The ANN models demonstrated good predictive ability, with a mean absolute error of 0.53 and 0.42 mg/L for the test data. However, a limitation of the study was that out of 26 parameters, only 8 showed significant deviations from the skewness and kurtosis limit values.
Alizamir et al. (2021) introduced a hybrid Bat-ELM model to forecast daily chlorophyll-a (Chl-a) levels in rivers. They used data from two USGS stations with input variables such as turbidity, pH, specific conductance, water temperature, and periodicity. The model achieved an R2 value of 0.89. However, the key factors influencing chlorophyll-a concentration can differ based on the particular ecosystem. Pham et al. (2021) used three machine learning methods (ANN, ANFIS, and GMDH) to estimate Water Quality Index (WQI) in surface wetlands. They monitored water quality parameters like conductivity, suspended solids, BOD, ammonia, COD, dissolved oxygen, temperature, pH, phosphate, nitrite, and nitrate at seventeen wetland points over 14 months. The ANFIS method performed the best, with a low MAE of 0.0219 and a high NSE of 0.96. However, deep neural networks did not incorporate prior knowledge effectively, leading to lower prediction accuracy and longer processing times. Lu et al. (2022) developed GBRT, LSTM, and RF to predict total phosphorus and nitrogen concentrations in Taihu Lake. The data used for the study on nitrogen and phosphorus levels in Taihu Lake was gathered between 2011 and 2018, focusing on the highest monthly amounts of these substances within the lake. Results showed that the LSTM performed better compared to other models based on the RMSE value of 0.11. Ottong et al. (2022) introduced four machine learning models (LR, SVM, RF, GBM) to forecast arsenic contamination risk in the Red River Delta. They used 512 data points with 38 hadrochemical parameters from 2005 to 2007. The topperforming model was GBM, achieving high accuracy, precision, sensitivity, and specificity at 98.7%, 100%, 95.2%, and 100% respectively. One drawback of this study is the limited amount of data. Having more data is important for creating improved models that can better adapt to different situations.
Hu et al. (2023) developed an XGB model to forecast nitrogen and phosphorus levels in Taihu lakes using 13 years of historical data. The model utilized water quality and meteorological data, achieving R2 values of 0.91 and 0.95. Sulaiman et al. (2023) compared seven machine learning models to predict nitrate concentrations for spectroscopic dataset, with the RF-PCA hybrid method performing the best at 92.7% accuracy. However, they did not choose specific features to simplify the prediction process and reduce spatial complexity. Liang et al. (2024) developed four machine learning models (GB, RF, XGB, AD) to predict nitrogen levels in Chongqing city. They analyzed 595 groundwater samples using various predictors like topography, remote sensing, hydrogeological data, climate factors, nitrate input, and socio-economic information. The GB model performed the best with an R2 of 0.627, MAE of 0.529, RMSE of 0.705, and PICP of 0.924. However, the study's limitation is that it was tested in a small area. Mehdaoui et al. (2024) introduced MLR and RBF-NN models to forecast nitrate levels in the Cheliff basin. They analyzed monthly data over a 10-year period. The RBF-NN model performed the best with an impressive accuracy of R2=0.957. One significant limitation is that the RBF-NN model is specifically suitable for this particular location. The overall literature review is summarized in Table 1, which organizes multiple articles based on selected criteria in a concise manner.
Table 1 . Literature Survey.
Paper | Machine Learning Mode | Performance Metrics | Key Findings | Limitation |
---|---|---|---|---|
Wagh et al. (2017) | ANN | R2= 0.75 | ANN model outperformed other methods in predicting nitrate concentrations in the Kadava River catchment | Small dataset size |
Rodriguez-Galiano et al. (2018) | CART, RF, SVM | AUC = 0.92 | RF-SSFS method outperformed others in nitrate-related groundwater contamination | Limited to a specific area |
Benzer et al. (2018) | ANN | Accuracy = 96 | ANN model effectively predicted nitrate concentrations in surface waters in a river basin in China. | The application of the model has not tested in the other regions. |
Rahmati et al. (2019) | KNN, RF, SVM | R2 = 0.72, RMSE = 10.41 | RF model outperformed traditional regression models in estimating nitrate concentration in streams in Iran. | The model depends on the assessing seasonal and interannual fluctuations of the nitrate concentrations. |
Knoll et al. (2019) | GBR, CART, MLR, RF | R2= 0.75 | GBR could more accurately estimate nitrate levels | The System did not enhance accuracy |
Jafari et al. (2019) | ANFIS, SVM, MLP, GEP | RMSE = 58.93, R = 0.998 | GEP model provided accurate TDS prediction in Tabriz plain aquifer | Machine learning techniques did not reduce time complexity |
Band et al. (2020) | BANN, Cubist, RF, SVM | R2 = 0.89, | RF model outperformed others in Marvdasht watershed, Iran | Limited to a specific region |
Bedi et al. (2020) | ANN, XGB, SVM | RMSE = 3.91 | XGB model excelled in predicting nitrate and pesticide contamination | Scarcity of labeled data for training advanced models. |
Hà et al. (2020) | RF | R2 = 0.92 | RF model performed better in estimating nitrate and phosphorus concentrations in Tri An reservoir | Focused solely on the Tri An reservoir. |
Latif et al. (2020) | ANN | Accuracy score = 0.94 | ANN model was superior in forecasting nitrate levels in Feitsui reservoir, Taiwan | Limited to five input Parameter |
Stamenković et al. (2020) | ANN, MLR | MAE = 0.53 | ANN models showed good predictive ability for nitrates in river water | Limited significant deviations in parameters |
Alizamir et al. (2021) | Hybrid Bat-ELM | R2= 0.89 | Hybrid model effectively predicted daily chlorophyll-a concentration in rivers. | Key factors influencing chlorophyll-a concentration may vary by ecosystem |
Pham et al. (2021) | ANN, ANFIS, GMDH | MAE = 0.0120.0219, NSE = 0.96 | ANFIS method excelled in estimating Water Quality Index in surface wetlands | Deep neural networks lacked effective incorporation of prior knowledge |
Lu et al. (2022) | GBRT, LSTM, RF | RMSE = 0.11 | LSTM model performed best in predicting total phosphorus and nitrogen concentrations in Taihu Lake | Focused on monthly data, limited temporal scope |
Ottong et al. (2022) | LR, SVM, RF, GBM | Accuracy = 87%, Precision = 100%, Sensitivity = 95.2%, Specificity = 100% | GBM model effectively forecasted arsenic contamination risk in the Red River Delta | Limited data points for model training |
Hu et al. (2023) | XGB | R2= 0.91 | XGB model effectively predicted nitrogen and phosphorus concentrations in Taihu lakes. | - |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | Accuracy: 92.8% | RF-PCA hybrid method outperformed other models in predicting nitrate concentrations for hydroponic plants. | Limited Input Size |
Liang et al. (2024) | GB | R2= 0.627, MAE: 0.529, RMSE: 0.705 | Developed models to predict nitrogen levels in Chongqing city using various predictors | Tested in a small area |
Mehdaoui et al. (2024) | RBF-NN | Accuracy = 0.957 | Introduced MLR and RBF-NN models to forecast nitrate levels in the Cheliff basin | This model specifically suitable for this location |
The application of machine learning techniques to predict water pollution has been unsuccessful in many situations, as mentioned in the literature review section. In this study, a novel machine learning technique known as GWO-stacked ensemble learning is applied to forecast nitrate contamination, which is described below. The main objective of this work is to improve the accuracy and speed of nitrate contamination prediction by using stacked ensemble learning approaches. Stack generalization is an approach that allows researchers to combine several prediction algorithms into one. Figure 1 depicts the workflow for this study. There are various steps to the experiment. First, the Tamilnadu Pollution Control Board (TNPCB) provided the dataset. It incorporates all the important water quality indicators. Water-quality data from the Cauvery River was used in this study. Typically, data pre-processing entails converting raw data into an informative format. This is a very crucial stage because datasets may contain errors, missing data, data redundancy, and noise. To solve the above issue, data pre-processing steps might be required. The next phase involves extracting relevant features via feature selection approaches using GWO. The advantages of feature selection include improving prediction accuracy, removing duplicate data from the dataset, and reducing the number of features without losing essential information. The next section compares several machine learning models, such as DT, KNN, MLP, and RF. Because each model has different classification skills, selecting the best-combined models is a difficult task in the research process. Finally, the results were assessed using several performance metrics in terms of accuracy, precision, sensitivity, specificity, F1-score, ROC, and MCC values.
This water quality dataset of the Cauvery River was collected by the TNPCB between 2018 and 2019. The dataset contains 792 samples and 26 features, respectively. The samples were taken at 33 monitoring sites in the Cauvery River catchment area. The water quality characteristics are described in Table 2. In addition, the Z-score normalization technique is used in the data pre-processing step, which improves the quality of the dataset. Data cleaning and labeling are two steps that need to be performed before using the data. A 70–30 train-test output validation scheme was used to ensure the reliability of our test.
Table 2 . Attribute of water quality dataset.
Variable | Description | Bureau of Indian Standard |
---|---|---|
NO3 | Nitrate | 10 |
Ph | Potential of Hydrogen | 6.5-8.5 |
Cl | Chloride | 250 |
BOD | Biological oxygen demand | Not mentioned |
DO | Dissolved Oxygen | Not mentioned |
FC | Fecal coliforms | 0.2 |
TC | Total coliforms | Not mentioned |
Tu | Turbidity | Not mentioned |
Pa | Phenolphthalein Alkalinity | Not mentioned |
Tal | Total Alkalinity | 200 |
EC | Electrical conductivity | Not mentioned |
N | Nitrogen | 4 |
COD | Chemical Oxygen Demand | Not mentioned |
NH3 | Ammonia | 50 |
Ca | Calcium | 75 |
Th | Total hardness | 300 |
K | Potassium | 0.4 |
Mg | Magnesium | 30 |
S04 | Sulphate | 200 |
Na | Sodium | 4 |
TDS | Total Dissolved Solids | 500 |
PO4 | Phosphate | Not mentioned |
TFS | Total Fixed Solids | 500 |
Br | Boron | 0.3 |
TSS | Total Suspended Solids | 500 |
F | Fluoride | 1 |
Grey wolf optimization (GWO) was proposed by Saitali Mirzali et al. (2014) and is more successful than other optimization algorithms such as differential evolution (DE), gravity search algorithm (GSA), genetic algorithm (GA), and particle swarm optimization (PSO). GWO has been applied in many real-world applications because of its superior search ability and its use of three solutions to generate an optimal global solution (Ullah et al., 2022). This algorithm is used in a variety of applications, including wind turbines (Yang et al., 2017), feature selection (Al-Tashi et al., 2019), and image classification (Raju et al., 2018).
The algorithm is based on the social hierarchy and hunting behavior of grey wolves in the wild. The grey wolf pack has a rigid social structure comprising alpha (α), beta (β), delta (δ), and omega (Figure 2). As pack leader, the alpha wolf assigns tasks to the other wolves. The beta wolf acts as a bridge between the alpha wolf and the other wolves in the pack, and its position can help the other wolves explore new regions in the search space. Delta wolves are called the heart of the pack, and their main job is hunting. The Omega wolves are at the bottom of the swarm and mostly serve as babysitters. Figure 3 is a flowchart explaining the operation of GWO.
The Grey wolf position vector may be defined as
In GWO, the hunting process behavior is described as follows
Where z= current iteration,
Where s1 and s2 are randomly initialized variables and represent a decrease in iteration from 2 to 0.
The presence of alpha, beta, and delta wolves in the hunting area has caused the status of grey wolves to be adjusted according to their relative positions to these wolves. Figure 4 illustrates the updated status of grey wolves in the hunting section.
where
The machine learning techniques DT, MLP, KNN, RF, and Stacked ensemble were used to predict water quality to accomplish this objective.
Decision Tree (DT): -The DT has three distinct components - an inner node, a branch node, and a leaf node - that function similarly to a traditional tree. Each inner node acts as a test variable, each branch indicates the result of the test, and each leaf node contains the class label. The entropy technique is employed to select the variable that will serve as the root of a decision tree. The tree is then divided into multiple subsets based on the values of the test attributes. This recursive approach is performed for each subset until they are all resolved. This recursive partitioning procedure separates the population into subpopulations depending on dichotomous variables, yielding a decision tree that appropriately identifies each person. (Myles et al., 2004).
KNN Algorithms: -KNN is a sluggish machine learning technique that can be utilized for classification and regression problems. This algorithm is widely used in data mining, pattern recognition, and intrusion detection. This approach uses distance calculations to provide unique predictions based on data that has been observed. The most commonly used methods for this calculation are the Euclidean distance, the Mahalanobis distance, and the cityblock distance. The K number of points is usually determined by how close the test data is to the known points. The advantage of KNN classification is its simplicity and non-parameter.
Multi-layer perceptron (MLP): -MLP is a kind of feed-forward ANN comprising a single-layer perceptron. An entry layer, a hidden layer, and an exit layer are three components used to create MLP. MLP has been used as a front propagation learning technique to transmit data from an input node to an output node. The learning capacity of MLP is determined by connection weights. The performance of the network increases over time by repeatedly adjusting the connection weights (Atangana et al., 2020; Joy et al., 2020). MLP is a supervised ML technique that is mostly used to classify patterns (Guo et al., 2020).
Random Forest: -RF is a supervised type of ML technique for regression and classification. It comprises several decision trees that depend on either the bagging or bootstrap aggregating approach. Random forest is used in ensemble learning techniques to solve complex problems and increase accuracy by merging individual models. The overall vote of all trees determines the final classification outcome (Chen et al., 2020).
Stacked is an ensemble learning approach that involves the integration of multiple base models to improve the overall prediction of machine learning. It is a higher-level approach to combine models compared to techniques, such as bagging and boosting, which focus on creating multiple models with different random subsets of the data or modifying the weights of training examples. The basic idea behind stacking is to use a set of diverse base models that are trained on different subsets of the data, using different algorithms and hyperparameters. Each base model makes its predictions, which are then combined with the meta-model to produce a final output. Figure 5 depicts the general form of the proposed stacked ensemble model. Random forest, Multiple-layer perception. Decision tree and KNN are the models used in the research study.
The pseudo-code for the stacked ensemble technique is given below
The following technique was performed for a hybrid GWO-stacked ensemble
Step 1: Collect the Cauvery River data from the Tamilnadu Pollution Control Board.
Step 2: Data pre-processing techniques are implemented using Z score normalization.
Step 3: The GWO feature selection approach is used to extract the essential features from the dataset.
Step 4: Divide the dataset into train and test sets.
Step 5: Training samples are analyzed using the stacked ensemble classification algorithm.
Step 6: The trained classifier is used on experimental data samples to predict whether NO3 contamination is at an acceptable level or not.
Step 7: Finally, the results recommend a suitable model for the prediction of NO3.
All experiments in this study were conducted with Python using the Jupyter Notebook framework on a Dell laptop, Intel Core™ i5-10210U CPU @ 1.60 GHz and 16 GB RAM. Pandas, NumPy, and Matplotlib libraries were used. The performance of the GWO-stacked model was evaluated using the following metrics, represented mathematically as follows.
Feature selection is a crucial step in the machine learning pipeline. It involves selecting a subset of relevant features from the original dataset to improve the performance and interpretability of the model. To show the superiority of the proposed GWO stacked ensemble method, four standalone ML models were also tested and used to compare their performances with those of the GWO stacked ensemble. To perform sensitivity analyses faster, the results were experimentally performed considering different input variables, i.e. BOD, Ca, Cl, K, Mg, Na, NH3, N, and S04 the best prediction accuracies were obtained with the GWO stacked ensemble.
The experimental results of the GWO stacked ensemble method are evaluated in comparison with different machine learning techniques such as DT, KNN, MLP, and RF. The GWO stacked ensemble method is tested in the Python environment using the Cauvery River data obtained from TNPCB. The results of the confusion matrix of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) are shown in Table 3. This value was used to determine the performance of a classification on test data. The GWO stacked ensemble method had the highest TP and the lowest FN of 76 and 6, respectively. In addition, the negative results were also perfectly predicted with a TN and FP of 76 and 4, respectively. This indicates that the GWO stacked ensemble technique has the best hyperparameters compared to other methods.
Table 3 . Confusion matrix result for test data.
S.NO | Classifier | TP | FP | FN | TN | Accuracy |
---|---|---|---|---|---|---|
1 | GWO-Stacked Ensemble (Proposed) | 76 | 4 | 6 | 75 | 0.93 |
2 | RF | 73 | 6 | 7 | 76 | 0.90 |
3 | DT | 71 | 8 | 9 | 73 | 0.88 |
4 | MLP | 72 | 10 | 10 | 69 | 0.87 |
5 | KNN | 70 | 10 | 11 | 69 | 0.86 |
Table 4 shows that our proposed model performs well compared to all other models in terms of other performance parameters. For the Matthew coefficient, the RF classifier achieved the second-highest score of 80%. The findings demonstrated no significant differences between DT and MLP, with precision and specificity ratings of 86% and 83%, respectively. However, the KNN classifier has the lowest accuracy, sensitivity, F1-score, and ROC scores, with values of 85%, 89%, 86%, and 90%, respectively. Figure 6 shows the performance comparison of each model. In this figure, the GWO-stacked model outperforms other ML models in many situations. Figure 3 shows the ROC curve of the predictive performance of all models. According to the graph, the GWO-stacked reached the maximum value of 0.95.
Table 4 . Comparison of the proposed model with the base classifiers.
Classifier | Accuracy | Precision | Sensitivity | Specificity | F1-Score | MCC |
---|---|---|---|---|---|---|
GWO-Stacked Ensemble (Proposed) | 0.93 | 0.95 | 0.92 | 0.92 | 0.93 | 0.85 |
RF | 0.90 | 0.92 | 0.91 | 0.90 | 0.91 | 0.83 |
DT | 0.88 | 0.89 | 0.88 | 0.89 | 0.89 | 0.78 |
MLP | 0.87 | 0.87 | 0.86 | 0.85 | 0.87 | 0.74 |
KNN | 0.86 | 0.87 | 0.87 | 0.86 | 0.86 | 0.72 |
Table 5 presents a performance assessment of the proposed method with base classifier methods to select the best model for predicting nitrate contamination. The optimal model is determined by dividing the data into test and training ranges. The ranges vary from 60%–40% to 90%– 10%. The performance is assessed using different evaluation metrics. Table 5 demonstrates the performance of the GWO-stacked algorithm compared to other studies in this field when the data is split at a ratio of 70:30. The outcomes of the GWO-stacked method are assessed using the seven metrics listed above, and this information is used to determine the best data separation threshold for predicting nitrate pollution in river water.
Table 5 . Comparison of the performance of the proposed methods against the basic classifier using data splitting validation.
Metrics | Data Split Ratio | Classification Method | ||||
---|---|---|---|---|---|---|
GWO-Stacked (Proposed) | RF | DT | MLP | KNN | ||
Accuracy | 60-40 | 91.5 | 88 | 86.5 | 85.3 | 83.55 |
70-30 | 93.21 | 89.78 | 88 | 87.65 | 85.52 | |
80-20 | 92.98 | 89.16 | 87.05 | 87.23 | 84.89 | |
90-10 | 91.94 | 88.94 | 86.83 | 85.32 | 82.94 | |
Precision | 60-40 | 91.56 | 85.09 | 84.78 | 84.35 | 81.36 |
70-30 | 93 | 87.26 | 86.72 | 86.72 | 83.88 | |
80-20 | 92.13 | 86.58 | 85.09 | 85.39 | 81.36 | |
90-10 | 91.04 | 85.98 | 84.21 | 83.96 | 80 | |
Sensitivity | 60-40 | 95.13 | 91.89 | 91.27 | 89.28 | 87.18 |
70-30 | 97.53 | 93.33 | 93 | 91.86 | 89.10 | |
80-20 | 96.41 | 91.56 | 91.04 | 89.74 | 88.72 | |
90-10 | 95.23 | 90.86 | 90.12 | 88.21 | 87.23 | |
Specificity | 60-40 | 86.94 | 82.92 | 81.89 | 81.11 | 78.27 |
70-30 | 88.56 | 84.23 | 83.98 | 83.15 | 80 | |
80-20 | 87.88 | 83.55 | 82.45 | 82.18 | 78.89 | |
90-10 | 86.52 | 82.10 | 81.89 | 81.23 | 77.25 | |
F1-Score | 60-40 | 92.41 | 88.72 | 86.94 | 85.36 | 85.10 |
70-30 | 94.28 | 90 | 89.56 | 88.78 | 86.23 | |
80-20 | 93.88 | 88.72 | 87.90 | 87.45 | 85.63 | |
90-10 | 92.58 | 86.28 | 85.92 | 85.11 | 84.23 | |
ROC | 60-40 | 93.88 | 92.12 | 90.25 | 89.41 | 88.23 |
70-30 | 95.23 | 94.15 | 92.10 | 91 | 90.05 | |
80-20 | 94.73 | 93.78 | 91.65 | 89.41 | 88.14 | |
90-10 | 93.12 | 92.89 | 92.10 | 88.11 | 87.23 | |
MCC | 60-40 | 81.56 | 77.89 | 76.23 | 75.96 | 72.18 |
70-30 | 83 | 79.52 | 78.36 | 75.12 | 74.89 | |
80-20 | 82.16 | 78.23 | 77.65 | 73.96 | 73.98 | |
90-10 | 81.23 | 77.23 | 76.63 | 72.13 | 71.08 |
Table 6 compares the accuracy of various advanced techniques with the proposed system, showing that GWOStacked had the highest accuracy in predicting nitrate concentration. Latif et al. (2020) achieved the highest accuracy of 93% using ANN, while Sulaiman et al. (2020) and Bhattarai et al. (2021) reached 92.8% accuracy each. On the other hand, Alizamir et al. (2021) and Knoll et al. (2019) had models with less than 90% accuracy in predicting nitrate concentration.
Table 6 . Shows a comparison of performance between the Proposed method and existing research.
Author | Model | Accuracy (%) |
---|---|---|
Proposed Model | GWO-Stacked Ensemble | 93 |
Latif et al. (2020) | ANN | 93 |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | 92.8 |
Bhattarai et al. (2021) | KNN, NB, RF, GB, SVM | 92.8 |
Alizamir et al. (2021) | Hybrid Bat-ELM | 89 |
Knoll et al. (2019) | GBR, CART, MLR, RF | 75 |
In this study, a machine learning approach called the GWO-stacked ensemble is proposed for predicting nitrogen pollution in the Cauvery River Delta region. The model involves data preprocessing to handle missing values and normalization, followed by feature selection using the grey wolf optimization technique. This method efficiently selects relevant features for input into the stacked ensemble algorithm, which mitigates issues like variance and overfitting seen in single-classifier models. The GWO-stacked ensemble outperformed DT, RF, MLP, and KNN models with an accuracy of 93%, precision of 93%, sensitivity of 97%, specificity of 88%, and F1-score of 94%. The ROC curve accuracy was highest at 95% with this technique. The research though it achieved its goals is limited by its reliance on a few factors. This narrow focus helps forecast levels even when data is scarce enhancing the usefulness of the models. Therefore, it's important for future studies to identify factors that could enhance the power of machine learning algorithms in this specific field.
The data that support the findings of this study are available from the corresponding author, [Vellingiri. J], upon reasonable request.
No potential conflict of interest was reported by the authors.
Table 1 . Literature Survey.
Paper | Machine Learning Mode | Performance Metrics | Key Findings | Limitation |
---|---|---|---|---|
Wagh et al. (2017) | ANN | R2= 0.75 | ANN model outperformed other methods in predicting nitrate concentrations in the Kadava River catchment | Small dataset size |
Rodriguez-Galiano et al. (2018) | CART, RF, SVM | AUC = 0.92 | RF-SSFS method outperformed others in nitrate-related groundwater contamination | Limited to a specific area |
Benzer et al. (2018) | ANN | Accuracy = 96 | ANN model effectively predicted nitrate concentrations in surface waters in a river basin in China. | The application of the model has not tested in the other regions. |
Rahmati et al. (2019) | KNN, RF, SVM | R2 = 0.72, RMSE = 10.41 | RF model outperformed traditional regression models in estimating nitrate concentration in streams in Iran. | The model depends on the assessing seasonal and interannual fluctuations of the nitrate concentrations. |
Knoll et al. (2019) | GBR, CART, MLR, RF | R2= 0.75 | GBR could more accurately estimate nitrate levels | The System did not enhance accuracy |
Jafari et al. (2019) | ANFIS, SVM, MLP, GEP | RMSE = 58.93, R = 0.998 | GEP model provided accurate TDS prediction in Tabriz plain aquifer | Machine learning techniques did not reduce time complexity |
Band et al. (2020) | BANN, Cubist, RF, SVM | R2 = 0.89, | RF model outperformed others in Marvdasht watershed, Iran | Limited to a specific region |
Bedi et al. (2020) | ANN, XGB, SVM | RMSE = 3.91 | XGB model excelled in predicting nitrate and pesticide contamination | Scarcity of labeled data for training advanced models. |
Hà et al. (2020) | RF | R2 = 0.92 | RF model performed better in estimating nitrate and phosphorus concentrations in Tri An reservoir | Focused solely on the Tri An reservoir. |
Latif et al. (2020) | ANN | Accuracy score = 0.94 | ANN model was superior in forecasting nitrate levels in Feitsui reservoir, Taiwan | Limited to five input Parameter |
Stamenković et al. (2020) | ANN, MLR | MAE = 0.53 | ANN models showed good predictive ability for nitrates in river water | Limited significant deviations in parameters |
Alizamir et al. (2021) | Hybrid Bat-ELM | R2= 0.89 | Hybrid model effectively predicted daily chlorophyll-a concentration in rivers. | Key factors influencing chlorophyll-a concentration may vary by ecosystem |
Pham et al. (2021) | ANN, ANFIS, GMDH | MAE = 0.0120.0219, NSE = 0.96 | ANFIS method excelled in estimating Water Quality Index in surface wetlands | Deep neural networks lacked effective incorporation of prior knowledge |
Lu et al. (2022) | GBRT, LSTM, RF | RMSE = 0.11 | LSTM model performed best in predicting total phosphorus and nitrogen concentrations in Taihu Lake | Focused on monthly data, limited temporal scope |
Ottong et al. (2022) | LR, SVM, RF, GBM | Accuracy = 87%, Precision = 100%, Sensitivity = 95.2%, Specificity = 100% | GBM model effectively forecasted arsenic contamination risk in the Red River Delta | Limited data points for model training |
Hu et al. (2023) | XGB | R2= 0.91 | XGB model effectively predicted nitrogen and phosphorus concentrations in Taihu lakes. | - |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | Accuracy: 92.8% | RF-PCA hybrid method outperformed other models in predicting nitrate concentrations for hydroponic plants. | Limited Input Size |
Liang et al. (2024) | GB | R2= 0.627, MAE: 0.529, RMSE: 0.705 | Developed models to predict nitrogen levels in Chongqing city using various predictors | Tested in a small area |
Mehdaoui et al. (2024) | RBF-NN | Accuracy = 0.957 | Introduced MLR and RBF-NN models to forecast nitrate levels in the Cheliff basin | This model specifically suitable for this location |
Table 2 . Attribute of water quality dataset.
Variable | Description | Bureau of Indian Standard |
---|---|---|
NO3 | Nitrate | 10 |
Ph | Potential of Hydrogen | 6.5-8.5 |
Cl | Chloride | 250 |
BOD | Biological oxygen demand | Not mentioned |
DO | Dissolved Oxygen | Not mentioned |
FC | Fecal coliforms | 0.2 |
TC | Total coliforms | Not mentioned |
Tu | Turbidity | Not mentioned |
Pa | Phenolphthalein Alkalinity | Not mentioned |
Tal | Total Alkalinity | 200 |
EC | Electrical conductivity | Not mentioned |
N | Nitrogen | 4 |
COD | Chemical Oxygen Demand | Not mentioned |
NH3 | Ammonia | 50 |
Ca | Calcium | 75 |
Th | Total hardness | 300 |
K | Potassium | 0.4 |
Mg | Magnesium | 30 |
S04 | Sulphate | 200 |
Na | Sodium | 4 |
TDS | Total Dissolved Solids | 500 |
PO4 | Phosphate | Not mentioned |
TFS | Total Fixed Solids | 500 |
Br | Boron | 0.3 |
TSS | Total Suspended Solids | 500 |
F | Fluoride | 1 |
Table 3 . Confusion matrix result for test data.
S.NO | Classifier | TP | FP | FN | TN | Accuracy |
---|---|---|---|---|---|---|
1 | GWO-Stacked Ensemble (Proposed) | 76 | 4 | 6 | 75 | 0.93 |
2 | RF | 73 | 6 | 7 | 76 | 0.90 |
3 | DT | 71 | 8 | 9 | 73 | 0.88 |
4 | MLP | 72 | 10 | 10 | 69 | 0.87 |
5 | KNN | 70 | 10 | 11 | 69 | 0.86 |
Table 4 . Comparison of the proposed model with the base classifiers.
Classifier | Accuracy | Precision | Sensitivity | Specificity | F1-Score | MCC |
---|---|---|---|---|---|---|
GWO-Stacked Ensemble (Proposed) | 0.93 | 0.95 | 0.92 | 0.92 | 0.93 | 0.85 |
RF | 0.90 | 0.92 | 0.91 | 0.90 | 0.91 | 0.83 |
DT | 0.88 | 0.89 | 0.88 | 0.89 | 0.89 | 0.78 |
MLP | 0.87 | 0.87 | 0.86 | 0.85 | 0.87 | 0.74 |
KNN | 0.86 | 0.87 | 0.87 | 0.86 | 0.86 | 0.72 |
Table 5 . Comparison of the performance of the proposed methods against the basic classifier using data splitting validation.
Metrics | Data Split Ratio | Classification Method | ||||
---|---|---|---|---|---|---|
GWO-Stacked (Proposed) | RF | DT | MLP | KNN | ||
Accuracy | 60-40 | 91.5 | 88 | 86.5 | 85.3 | 83.55 |
70-30 | 93.21 | 89.78 | 88 | 87.65 | 85.52 | |
80-20 | 92.98 | 89.16 | 87.05 | 87.23 | 84.89 | |
90-10 | 91.94 | 88.94 | 86.83 | 85.32 | 82.94 | |
Precision | 60-40 | 91.56 | 85.09 | 84.78 | 84.35 | 81.36 |
70-30 | 93 | 87.26 | 86.72 | 86.72 | 83.88 | |
80-20 | 92.13 | 86.58 | 85.09 | 85.39 | 81.36 | |
90-10 | 91.04 | 85.98 | 84.21 | 83.96 | 80 | |
Sensitivity | 60-40 | 95.13 | 91.89 | 91.27 | 89.28 | 87.18 |
70-30 | 97.53 | 93.33 | 93 | 91.86 | 89.10 | |
80-20 | 96.41 | 91.56 | 91.04 | 89.74 | 88.72 | |
90-10 | 95.23 | 90.86 | 90.12 | 88.21 | 87.23 | |
Specificity | 60-40 | 86.94 | 82.92 | 81.89 | 81.11 | 78.27 |
70-30 | 88.56 | 84.23 | 83.98 | 83.15 | 80 | |
80-20 | 87.88 | 83.55 | 82.45 | 82.18 | 78.89 | |
90-10 | 86.52 | 82.10 | 81.89 | 81.23 | 77.25 | |
F1-Score | 60-40 | 92.41 | 88.72 | 86.94 | 85.36 | 85.10 |
70-30 | 94.28 | 90 | 89.56 | 88.78 | 86.23 | |
80-20 | 93.88 | 88.72 | 87.90 | 87.45 | 85.63 | |
90-10 | 92.58 | 86.28 | 85.92 | 85.11 | 84.23 | |
ROC | 60-40 | 93.88 | 92.12 | 90.25 | 89.41 | 88.23 |
70-30 | 95.23 | 94.15 | 92.10 | 91 | 90.05 | |
80-20 | 94.73 | 93.78 | 91.65 | 89.41 | 88.14 | |
90-10 | 93.12 | 92.89 | 92.10 | 88.11 | 87.23 | |
MCC | 60-40 | 81.56 | 77.89 | 76.23 | 75.96 | 72.18 |
70-30 | 83 | 79.52 | 78.36 | 75.12 | 74.89 | |
80-20 | 82.16 | 78.23 | 77.65 | 73.96 | 73.98 | |
90-10 | 81.23 | 77.23 | 76.63 | 72.13 | 71.08 |
Table 6 . Shows a comparison of performance between the Proposed method and existing research.
Author | Model | Accuracy (%) |
---|---|---|
Proposed Model | GWO-Stacked Ensemble | 93 |
Latif et al. (2020) | ANN | 93 |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | 92.8 |
Bhattarai et al. (2021) | KNN, NB, RF, GB, SVM | 92.8 |
Alizamir et al. (2021) | Hybrid Bat-ELM | 89 |
Knoll et al. (2019) | GBR, CART, MLR, RF | 75 |
Zheina J. Ottong, Reta L. Puspasari, Daeung Yoon, Kyoung-Woong Kim
Econ. Environ. Geol. 2022; 55(2): 127-135Jongpil Won, Hyunggu Jun
Econ. Environ. Geol. 2024; 57(6): 681-699Ju Young Park, Sun Young Park, Jiyoung Choi, Sungil Kim, Yuri Kim, Bo Yeon Yi, Kyungbook Lee
Econ. Environ. Geol. 2024; 57(5): 529-537