Econ. Environ. Geol. 2024; 57(3): 329-342
Published online June 30, 2024
https://doi.org/10.9719/EEG.2024.57.3.329
© THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY
Correspondence to : *vellingiri.j@vit.ac.in
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.
The exponential increase in nitrate pollution of river water poses an immediate threat to public health and the environment. This contamination is primarily due to various human activities, which include the overuse of nitrogenous fertilizers in agriculture and the discharge of nitrate-rich industrial effluents into rivers. As a result, the accurate prediction and identification of contaminated areas has become a crucial and challenging task for researchers. To solve these problems, this work leads to the prediction of nitrate contamination using machine learning approaches. This paper presents a novel approach known as Grey Wolf Optimizer (GWO) based on the Stacked Ensemble approach for predicting nitrate pollution in the Cauvery Delta region of Tamilnadu, India. The proposed method is evaluated using a Cauvery River dataset from the Tamilnadu Pollution Control Board. The proposed method shows excellent performance, achieving an accuracy of 93.31%, a precision of 93%, a sensitivity of 97.53%, a specificity of 94.28%, an F1-score of 95.23%, and an ROC score of 95%. These impressive results underline the demonstration of the proposed method in accurately predicting nitrate pollution in river water and ultimately help to make informed decisions to tackle these critical environmental problems.
Keywords nitrate prediction, machine learning, stacked ensemble, decision tree, random forest
Econ. Environ. Geol. 2024; 57(3): 329-342
Published online June 30, 2024 https://doi.org/10.9719/EEG.2024.57.3.329
Copyright © THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY.
Kalaivanan K, Vellingiri J*
School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore-632014, India
Correspondence to:*vellingiri.j@vit.ac.in
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.
The exponential increase in nitrate pollution of river water poses an immediate threat to public health and the environment. This contamination is primarily due to various human activities, which include the overuse of nitrogenous fertilizers in agriculture and the discharge of nitrate-rich industrial effluents into rivers. As a result, the accurate prediction and identification of contaminated areas has become a crucial and challenging task for researchers. To solve these problems, this work leads to the prediction of nitrate contamination using machine learning approaches. This paper presents a novel approach known as Grey Wolf Optimizer (GWO) based on the Stacked Ensemble approach for predicting nitrate pollution in the Cauvery Delta region of Tamilnadu, India. The proposed method is evaluated using a Cauvery River dataset from the Tamilnadu Pollution Control Board. The proposed method shows excellent performance, achieving an accuracy of 93.31%, a precision of 93%, a sensitivity of 97.53%, a specificity of 94.28%, an F1-score of 95.23%, and an ROC score of 95%. These impressive results underline the demonstration of the proposed method in accurately predicting nitrate pollution in river water and ultimately help to make informed decisions to tackle these critical environmental problems.
Keywords nitrate prediction, machine learning, stacked ensemble, decision tree, random forest
Table 1 . Literature Survey.
Paper | Machine Learning Mode | Performance Metrics | Key Findings | Limitation |
---|---|---|---|---|
Wagh et al. (2017) | ANN | R2= 0.75 | ANN model outperformed other methods in predicting nitrate concentrations in the Kadava River catchment | Small dataset size |
Rodriguez-Galiano et al. (2018) | CART, RF, SVM | AUC = 0.92 | RF-SSFS method outperformed others in nitrate-related groundwater contamination | Limited to a specific area |
Benzer et al. (2018) | ANN | Accuracy = 96 | ANN model effectively predicted nitrate concentrations in surface waters in a river basin in China. | The application of the model has not tested in the other regions. |
Rahmati et al. (2019) | KNN, RF, SVM | R2 = 0.72, RMSE = 10.41 | RF model outperformed traditional regression models in estimating nitrate concentration in streams in Iran. | The model depends on the assessing seasonal and interannual fluctuations of the nitrate concentrations. |
Knoll et al. (2019) | GBR, CART, MLR, RF | R2= 0.75 | GBR could more accurately estimate nitrate levels | The System did not enhance accuracy |
Jafari et al. (2019) | ANFIS, SVM, MLP, GEP | RMSE = 58.93, R = 0.998 | GEP model provided accurate TDS prediction in Tabriz plain aquifer | Machine learning techniques did not reduce time complexity |
Band et al. (2020) | BANN, Cubist, RF, SVM | R2 = 0.89, | RF model outperformed others in Marvdasht watershed, Iran | Limited to a specific region |
Bedi et al. (2020) | ANN, XGB, SVM | RMSE = 3.91 | XGB model excelled in predicting nitrate and pesticide contamination | Scarcity of labeled data for training advanced models. |
Hà et al. (2020) | RF | R2 = 0.92 | RF model performed better in estimating nitrate and phosphorus concentrations in Tri An reservoir | Focused solely on the Tri An reservoir. |
Latif et al. (2020) | ANN | Accuracy score = 0.94 | ANN model was superior in forecasting nitrate levels in Feitsui reservoir, Taiwan | Limited to five input Parameter |
Stamenković et al. (2020) | ANN, MLR | MAE = 0.53 | ANN models showed good predictive ability for nitrates in river water | Limited significant deviations in parameters |
Alizamir et al. (2021) | Hybrid Bat-ELM | R2= 0.89 | Hybrid model effectively predicted daily chlorophyll-a concentration in rivers. | Key factors influencing chlorophyll-a concentration may vary by ecosystem |
Pham et al. (2021) | ANN, ANFIS, GMDH | MAE = 0.0120.0219, NSE = 0.96 | ANFIS method excelled in estimating Water Quality Index in surface wetlands | Deep neural networks lacked effective incorporation of prior knowledge |
Lu et al. (2022) | GBRT, LSTM, RF | RMSE = 0.11 | LSTM model performed best in predicting total phosphorus and nitrogen concentrations in Taihu Lake | Focused on monthly data, limited temporal scope |
Ottong et al. (2022) | LR, SVM, RF, GBM | Accuracy = 87%, Precision = 100%, Sensitivity = 95.2%, Specificity = 100% | GBM model effectively forecasted arsenic contamination risk in the Red River Delta | Limited data points for model training |
Hu et al. (2023) | XGB | R2= 0.91 | XGB model effectively predicted nitrogen and phosphorus concentrations in Taihu lakes. | - |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | Accuracy: 92.8% | RF-PCA hybrid method outperformed other models in predicting nitrate concentrations for hydroponic plants. | Limited Input Size |
Liang et al. (2024) | GB | R2= 0.627, MAE: 0.529, RMSE: 0.705 | Developed models to predict nitrogen levels in Chongqing city using various predictors | Tested in a small area |
Mehdaoui et al. (2024) | RBF-NN | Accuracy = 0.957 | Introduced MLR and RBF-NN models to forecast nitrate levels in the Cheliff basin | This model specifically suitable for this location |
Table 2 . Attribute of water quality dataset.
Variable | Description | Bureau of Indian Standard |
---|---|---|
NO3 | Nitrate | 10 |
Ph | Potential of Hydrogen | 6.5-8.5 |
Cl | Chloride | 250 |
BOD | Biological oxygen demand | Not mentioned |
DO | Dissolved Oxygen | Not mentioned |
FC | Fecal coliforms | 0.2 |
TC | Total coliforms | Not mentioned |
Tu | Turbidity | Not mentioned |
Pa | Phenolphthalein Alkalinity | Not mentioned |
Tal | Total Alkalinity | 200 |
EC | Electrical conductivity | Not mentioned |
N | Nitrogen | 4 |
COD | Chemical Oxygen Demand | Not mentioned |
NH3 | Ammonia | 50 |
Ca | Calcium | 75 |
Th | Total hardness | 300 |
K | Potassium | 0.4 |
Mg | Magnesium | 30 |
S04 | Sulphate | 200 |
Na | Sodium | 4 |
TDS | Total Dissolved Solids | 500 |
PO4 | Phosphate | Not mentioned |
TFS | Total Fixed Solids | 500 |
Br | Boron | 0.3 |
TSS | Total Suspended Solids | 500 |
F | Fluoride | 1 |
Table 3 . Confusion matrix result for test data.
S.NO | Classifier | TP | FP | FN | TN | Accuracy |
---|---|---|---|---|---|---|
1 | GWO-Stacked Ensemble (Proposed) | 76 | 4 | 6 | 75 | 0.93 |
2 | RF | 73 | 6 | 7 | 76 | 0.90 |
3 | DT | 71 | 8 | 9 | 73 | 0.88 |
4 | MLP | 72 | 10 | 10 | 69 | 0.87 |
5 | KNN | 70 | 10 | 11 | 69 | 0.86 |
Table 4 . Comparison of the proposed model with the base classifiers.
Classifier | Accuracy | Precision | Sensitivity | Specificity | F1-Score | MCC |
---|---|---|---|---|---|---|
GWO-Stacked Ensemble (Proposed) | 0.93 | 0.95 | 0.92 | 0.92 | 0.93 | 0.85 |
RF | 0.90 | 0.92 | 0.91 | 0.90 | 0.91 | 0.83 |
DT | 0.88 | 0.89 | 0.88 | 0.89 | 0.89 | 0.78 |
MLP | 0.87 | 0.87 | 0.86 | 0.85 | 0.87 | 0.74 |
KNN | 0.86 | 0.87 | 0.87 | 0.86 | 0.86 | 0.72 |
Table 5 . Comparison of the performance of the proposed methods against the basic classifier using data splitting validation.
Metrics | Data Split Ratio | Classification Method | ||||
---|---|---|---|---|---|---|
GWO-Stacked (Proposed) | RF | DT | MLP | KNN | ||
Accuracy | 60-40 | 91.5 | 88 | 86.5 | 85.3 | 83.55 |
70-30 | 93.21 | 89.78 | 88 | 87.65 | 85.52 | |
80-20 | 92.98 | 89.16 | 87.05 | 87.23 | 84.89 | |
90-10 | 91.94 | 88.94 | 86.83 | 85.32 | 82.94 | |
Precision | 60-40 | 91.56 | 85.09 | 84.78 | 84.35 | 81.36 |
70-30 | 93 | 87.26 | 86.72 | 86.72 | 83.88 | |
80-20 | 92.13 | 86.58 | 85.09 | 85.39 | 81.36 | |
90-10 | 91.04 | 85.98 | 84.21 | 83.96 | 80 | |
Sensitivity | 60-40 | 95.13 | 91.89 | 91.27 | 89.28 | 87.18 |
70-30 | 97.53 | 93.33 | 93 | 91.86 | 89.10 | |
80-20 | 96.41 | 91.56 | 91.04 | 89.74 | 88.72 | |
90-10 | 95.23 | 90.86 | 90.12 | 88.21 | 87.23 | |
Specificity | 60-40 | 86.94 | 82.92 | 81.89 | 81.11 | 78.27 |
70-30 | 88.56 | 84.23 | 83.98 | 83.15 | 80 | |
80-20 | 87.88 | 83.55 | 82.45 | 82.18 | 78.89 | |
90-10 | 86.52 | 82.10 | 81.89 | 81.23 | 77.25 | |
F1-Score | 60-40 | 92.41 | 88.72 | 86.94 | 85.36 | 85.10 |
70-30 | 94.28 | 90 | 89.56 | 88.78 | 86.23 | |
80-20 | 93.88 | 88.72 | 87.90 | 87.45 | 85.63 | |
90-10 | 92.58 | 86.28 | 85.92 | 85.11 | 84.23 | |
ROC | 60-40 | 93.88 | 92.12 | 90.25 | 89.41 | 88.23 |
70-30 | 95.23 | 94.15 | 92.10 | 91 | 90.05 | |
80-20 | 94.73 | 93.78 | 91.65 | 89.41 | 88.14 | |
90-10 | 93.12 | 92.89 | 92.10 | 88.11 | 87.23 | |
MCC | 60-40 | 81.56 | 77.89 | 76.23 | 75.96 | 72.18 |
70-30 | 83 | 79.52 | 78.36 | 75.12 | 74.89 | |
80-20 | 82.16 | 78.23 | 77.65 | 73.96 | 73.98 | |
90-10 | 81.23 | 77.23 | 76.63 | 72.13 | 71.08 |
Table 6 . Shows a comparison of performance between the Proposed method and existing research.
Author | Model | Accuracy (%) |
---|---|---|
Proposed Model | GWO-Stacked Ensemble | 93 |
Latif et al. (2020) | ANN | 93 |
Sulaiman et al. (2023) | KNN, SVM, DT, NB, RF, GB, XGB | 92.8 |
Bhattarai et al. (2021) | KNN, NB, RF, GB, SVM | 92.8 |
Alizamir et al. (2021) | Hybrid Bat-ELM | 89 |
Knoll et al. (2019) | GBR, CART, MLR, RF | 75 |
Zheina J. Ottong, Reta L. Puspasari, Daeung Yoon, Kyoung-Woong Kim
Econ. Environ. Geol. 2022; 55(2): 127-135Kyoungeun Lee, Jaehyung Yu, Chanhyeok Park, Trung Hieu Pham
Econ. Environ. Geol. 2024; 57(4): 353-362Jongpil Won, Jungkyun Shin, Jiho Ha, Hyunggu Jun
Econ. Environ. Geol. 2024; 57(1): 51-71