Research Paper

Split Viewer

Econ. Environ. Geol. 2022; 55(4): 353-366

Published online August 30, 2022

https://doi.org/10.9719/EEG.2022.55.4.353

© THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY

Estimation of Spatial Distribution Using the Gaussian Mixture Model with Multivariate Geoscience Data

Ho-Rim Kim1, Soonyoung Yu2, Seong-Taek Yun2, Kyoung-Ho Kim3, Goon-Taek Lee4, Jeong-Ho Lee1, Chul-Ho Heo1, Dong-Woo Ryu1,*

1Korea Institute of Geoscience and Mineral Resources, Republic of Korea
2Korea University, Republic of Korea
3Korea Environment Institute, Republic of Korea
4National Instrumentation Center for Environmental Management, Seoul National University, Republic of Korea

Received: August 14, 2022; Revised: August 23, 2022; Accepted: August 23, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.

Abstract

Spatial estimation of geoscience data (geo-data) is challenging due to spatial heterogeneity, data scarcity, and high dimensionality. A novel spatial estimation method is needed to consider the characteristics of geo-data. In this study, we proposed the application of Gaussian Mixture Model (GMM) among machine learning algorithms with multivariate data for robust spatial predictions. The performance of the proposed approach was tested through soil chemical concentration data from a former smelting area. The concentrations of As and Pb determined by ex-situ ICP-AES were the primary variables to be interpolated, while the other metal concentrations by ICP-AES and all data determined by in-situ portable X-ray fluorescence (PXRF) were used as auxiliary variables in GMM and ordinary cokriging (OCK). Among the multidimensional auxiliary variables, important variables were selected using a variable selection method based on the random forest. The results of GMM with important multivariate auxiliary data decreased the root mean-squared error (RMSE) down to 0.11 for As and 0.33 for Pb and increased the correlations (r) up to 0.31 for As and 0.46 for Pb compared to those from ordinary kriging and OCK using univariate or bivariate data. The use of GMM improved the performance of spatial interpretation of anthropogenic metals in soil. The multivariate spatial approach can be applied to understand complex and heterogeneous geological and geochemical features.

Keywords Gaussian Mixture Model (GMM), multivariate, geoscience data (geo-data), machine learning, soil contamination

다변량 지구과학 데이터와 가우시안 혼합 모델을 이용한 공간 분포 추정

김호림1 · 유순영2 · 윤성택2 · 김경호3 · 이군택4 · 이정호1 · 허철호1 · 류동우1*

1한국지질자원연구원
2고려대학교
3한국환경연구원
4서울대학교 NICEM

요 약

지구과학 데이터(지오데이터)의 공간 이질성, 희소성 및 고차원성으로 인해 공간 분포 추정에 어려움이 있다. 따라서 지구과학의 많은 응용 분야에서 지오데이터의 고유 특성을 고려할 수 있는 공간 추정 기법이 필요하다. 본 연구에서는 기계 학습 알고리즘 중 하나인 가우시안 혼합 모델(Gaussian Mixture Model; GMM)을 이용하여 공간 예측 방법을 제공하고자 하였다. 제안된 기법의 성능을 검증하기 위해, 옛 제련소 부지에서 휴대용 X선 형광분석기(PXRF) 및 유도결합플라즈마-원자방출분광법(ICPAES)을 이용하여 분석된 토양 농도 자료를 활용하였다. ICP-AES를 이용해 분석된 As와 Pb를 주변수로 하고, 나머지 자료는 보조변수로 활용하였다. 다차원의 보조변수 중 중요 변수를 선별하기 위해 랜덤포레스트 기반의 변수선택법을 적용하였다. ICPAES 및 PXRF를 통해 구축된 다변량 데이터를 사용한 GMM의 결과를 단변량 및 이변량 데이터를 사용한 정규 크리깅(Ordinary Kriging; OK) 및 정규 공동크리깅(Ordinary Co-Kriging; OCK)의 결과와 비교하였다. GMM의 결과는 OK 및 OCK의 결과보다 낮은 평균 제곱근 편차(RMSE; 비소는 최대 0.11 및 납은 0.33까지 향상)와 높은 상관관계(r; 비소는 최대 0.31 및 납은 0.46까지 향상)를 제공하였다. 이는 GMM을 사용할 경우 토양 오염의 범위 해석의 성능을 향상시킬 수 있음을 지시한다. 본 연구는 다변량 공간추정 접근법이복잡하고 이질적인 지질 및 지구 화학자료의 특징을 이해하는 데 효과적으로적용될 수 있음을증명하였다.

주요어 가우시안 혼합모형, 다변량, 지구과학데이터(지오데이터), 기계학습, 토양오염

Article

Research Paper

Econ. Environ. Geol. 2022; 55(4): 353-366

Published online August 30, 2022 https://doi.org/10.9719/EEG.2022.55.4.353

Copyright © THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY.

Estimation of Spatial Distribution Using the Gaussian Mixture Model with Multivariate Geoscience Data

Ho-Rim Kim1, Soonyoung Yu2, Seong-Taek Yun2, Kyoung-Ho Kim3, Goon-Taek Lee4, Jeong-Ho Lee1, Chul-Ho Heo1, Dong-Woo Ryu1,*

1Korea Institute of Geoscience and Mineral Resources, Republic of Korea
2Korea University, Republic of Korea
3Korea Environment Institute, Republic of Korea
4National Instrumentation Center for Environmental Management, Seoul National University, Republic of Korea

Received: August 14, 2022; Revised: August 23, 2022; Accepted: August 23, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.

Abstract

Spatial estimation of geoscience data (geo-data) is challenging due to spatial heterogeneity, data scarcity, and high dimensionality. A novel spatial estimation method is needed to consider the characteristics of geo-data. In this study, we proposed the application of Gaussian Mixture Model (GMM) among machine learning algorithms with multivariate data for robust spatial predictions. The performance of the proposed approach was tested through soil chemical concentration data from a former smelting area. The concentrations of As and Pb determined by ex-situ ICP-AES were the primary variables to be interpolated, while the other metal concentrations by ICP-AES and all data determined by in-situ portable X-ray fluorescence (PXRF) were used as auxiliary variables in GMM and ordinary cokriging (OCK). Among the multidimensional auxiliary variables, important variables were selected using a variable selection method based on the random forest. The results of GMM with important multivariate auxiliary data decreased the root mean-squared error (RMSE) down to 0.11 for As and 0.33 for Pb and increased the correlations (r) up to 0.31 for As and 0.46 for Pb compared to those from ordinary kriging and OCK using univariate or bivariate data. The use of GMM improved the performance of spatial interpretation of anthropogenic metals in soil. The multivariate spatial approach can be applied to understand complex and heterogeneous geological and geochemical features.

Keywords Gaussian Mixture Model (GMM), multivariate, geoscience data (geo-data), machine learning, soil contamination

다변량 지구과학 데이터와 가우시안 혼합 모델을 이용한 공간 분포 추정

김호림1 · 유순영2 · 윤성택2 · 김경호3 · 이군택4 · 이정호1 · 허철호1 · 류동우1*

1한국지질자원연구원
2고려대학교
3한국환경연구원
4서울대학교 NICEM

Received: August 14, 2022; Revised: August 23, 2022; Accepted: August 23, 2022

요 약

지구과학 데이터(지오데이터)의 공간 이질성, 희소성 및 고차원성으로 인해 공간 분포 추정에 어려움이 있다. 따라서 지구과학의 많은 응용 분야에서 지오데이터의 고유 특성을 고려할 수 있는 공간 추정 기법이 필요하다. 본 연구에서는 기계 학습 알고리즘 중 하나인 가우시안 혼합 모델(Gaussian Mixture Model; GMM)을 이용하여 공간 예측 방법을 제공하고자 하였다. 제안된 기법의 성능을 검증하기 위해, 옛 제련소 부지에서 휴대용 X선 형광분석기(PXRF) 및 유도결합플라즈마-원자방출분광법(ICPAES)을 이용하여 분석된 토양 농도 자료를 활용하였다. ICP-AES를 이용해 분석된 As와 Pb를 주변수로 하고, 나머지 자료는 보조변수로 활용하였다. 다차원의 보조변수 중 중요 변수를 선별하기 위해 랜덤포레스트 기반의 변수선택법을 적용하였다. ICPAES 및 PXRF를 통해 구축된 다변량 데이터를 사용한 GMM의 결과를 단변량 및 이변량 데이터를 사용한 정규 크리깅(Ordinary Kriging; OK) 및 정규 공동크리깅(Ordinary Co-Kriging; OCK)의 결과와 비교하였다. GMM의 결과는 OK 및 OCK의 결과보다 낮은 평균 제곱근 편차(RMSE; 비소는 최대 0.11 및 납은 0.33까지 향상)와 높은 상관관계(r; 비소는 최대 0.31 및 납은 0.46까지 향상)를 제공하였다. 이는 GMM을 사용할 경우 토양 오염의 범위 해석의 성능을 향상시킬 수 있음을 지시한다. 본 연구는 다변량 공간추정 접근법이복잡하고 이질적인 지질 및 지구 화학자료의 특징을 이해하는 데 효과적으로적용될 수 있음을증명하였다.

주요어 가우시안 혼합모형, 다변량, 지구과학데이터(지오데이터), 기계학습, 토양오염

    Fig 1.

    Figure 1.Flow chart of statistical procedures for estimating soil contamination areas.
    Economic and Environmental Geology 2022; 55: 353-366https://doi.org/10.9719/EEG.2022.55.4.353

    Fig 2.

    Figure 2.Correlation chart of multivariate soil data. The histogram of each variable is shown on the diagonal with primary variables (As and Pb determined using ICP-AES) in blue and auxiliary variables (Cu, Ni, and Zn using ICP-AES and all PXRF data) in green. The values on the upper right of the diagonal indicate the correlation coefficients. The stars indicate the significance levels (*** p < 0.001, ** p < 0.01, * p < 0.05). On the bottom left of the diagonal, the bivariate scatter plots are displayed with a fitted red line. Values on the boundaries indicate the concentrations on a log scale.
    Economic and Environmental Geology 2022; 55: 353-366https://doi.org/10.9719/EEG.2022.55.4.353

    Fig 3.

    Figure 3.Pairwise probability density functions fitted to the Gaussian Mixture Model (GMM) and EVE (ellipsoidal, equal volume and orientation) model. Values on the boundaries indicate probability density.
    Economic and Environmental Geology 2022; 55: 353-366https://doi.org/10.9719/EEG.2022.55.4.353

    Fig 4.

    Figure 4.Spatial distributions of As concentrations in the study area: (a) the results of ordinary kriging (OK) using 49 training data; (b) and (c) the results of ordinary co-kriging (OCK) and Gaussian Mixture Model (GMM), respectively, using 49 training data and 156 auxiliary data (As determined by PXRF for OCK and ICP-AES Pb, Cu, Ni, and Zn and PXRF As, Pb, and Cu for GMM); (d) the most realistic distribution of contamination that was estimated by OK using all the ICP-AES data (n=153). The black line indicates the regulatory level (25 mg/kg) by the Soil Environmental Conservation Act of the Republic of Korea.
    Economic and Environmental Geology 2022; 55: 353-366https://doi.org/10.9719/EEG.2022.55.4.353

    Fig 5.

    Figure 5.Root mean-square errors (RMSE) between the measured and predicted values using the validation data (n=30) at different sampling densities for training (n=30 to 107 in Table 2). (a) As, (b) Pb.
    Economic and Environmental Geology 2022; 55: 353-366https://doi.org/10.9719/EEG.2022.55.4.353

    Fig 6.

    Figure 6.Pearson correlation coefficients (r) between the measured and predicted values using the validation data (n=30) at different sampling densities for training (n=30 to 107 in Table 2): (a) As, (b) Pb.
    Economic and Environmental Geology 2022; 55: 353-366https://doi.org/10.9719/EEG.2022.55.4.353

    Table 1 . The descriptive statistics of metal (loid) contents in soil samples by ex-situ (ICP-AES) and in-situ (Portable XRF) measurements.

    Unit: mg kg-1Laboratory analysis using ICP-AES (n=153)Portable XRF (PXRF) measurements in the field (n=156)
    AsPbCuNiZnAsPbCuNiZn
    Minimum4.4113.716.213.8716.000.5012.0012.0014.60137.00
    Maximum236.70961.67167.7033.70112.83143.00430.00145.0025.00205.00
    Range232.29947.96161.4929.8396.83142.50418.00133.0010.4068.00
    Median66.33160.4753.2311.4743.4722.0060.5034.0020.30163.00
    Mean74.48196.9356.0211.4543.2527.9574.3338.0520.30164.40
    SE.mean*4.1713.422.670.301.142.044.771.460.170.79
    CI.mean**8.2426.515.270.602.264.049.412.890.331.57
    Var.***2662.3327540.861087.0413.89200.45651.633542.20333.704.4998.59
    Std.dev.****51.60165.9532.973.7314.1625.5359.5218.272.129.93
    Coef.var*****0.690.840.590.330.330.910.800.480.100.06

    * SE.mean: the standard error of the mean; ** CI.mean: the confidence interval of the mean at the p level of 0.95; *** Var: the variance; **** Std.dev: the standard deviation; ***** Coef.var: the variation coefficient defined as the standard deviation divided by the mean.


    Table 2 . Correlation coefficient (r) and root mean-squared error (RMSE) between the measured and predicted values using the validation data (n=30 determined by ICE-AES) at different sampling densities for training (n=30 to 107 determined by ICP-AES) by each model: OK (ordinary kriging), OCK (ordinary co-kriging), GMM (Gaussian mixture model).

    AsPb
    Sampling densityPrediction methodrRMSErGMM ‒ rgeost.*RMSEGMM‒RMSEgeost.**Sampling densityPrediction methodrRMSErGMM ‒ rgeost.*RMSEGMM‒RMSEgeost.**
    30OK0.590.280.27-0.0530OK0.460.370.46-0.2
    OCK0.640.260.22-0.03OCK0.60.50.32-0.33
    GMM0.860.23GMM0.920.17
    49OK0.610.280.31-0.1149OK0.520.340.39-0.17
    OCK0.660.260.26-0.09OCK0.610.310.3-0.14
    GMM0.920.17GMM0.910.17
    61OK0.670.250.26-0.0961OK0.540.330.43-0.25
    OCK0.720.240.21-0.08OCK0.620.310.35-0.23
    GMM0.930.16GMM0.970.08
    76OK0.730.240.11-0.0176OK0.580.320.38-0.2
    OCK0.760.220.080.01OCK0.660.290.3-0.17
    GMM0.840.23GMM0.960.12
    91OK0.730.240.22-0.191OK0.610.310.37-0.23
    OCK0.770.220.18-0.08OCK0.680.290.3-0.21
    GMM0.950.14GMM0.980.08
    107OK0.750.230.21-0.1107OK0.620.310.35-0.2
    OCK0.790.210.17-0.08OCK0.680.280.29-0.17
    GMM0.960.13GMM0.970.11

    *rGMM ‒ rgeost.: performance comparison (r) between GMM and geostatistical approach (OK or OCK);.

    **RMSEGMM ‒RMSEgeost: performance comparison (RMSE) between GMM and geostatistical approach (OK or OCK)..


    KSEEG
    Feb 29, 2024 Vol.57 No.1, pp. 1~91

    Stats or Metrics

    Share this article on

    • kakao talk
    • line

    Related articles in KSEEG

    Economic and Environmental Geology

    pISSN 1225-7281
    eISSN 2288-7962
    qr-code Download