search for


Study on the Effect of Training Data Sampling Strategy on the Accuracy of the Landslide Susceptibility Analysis using Random Forest Method
Random Forest 기법을 이용한 산사태 취약성 평가 시 훈련 데이터 선택이 결과 정확도에 미치는 영향
Econ. Environ. Geol. 2019 Apr;52(2):199-212
Published online April 30, 2019;
Copyright © 2019 the Korean society of economic and environmental gelology.

Kyoung-Hee Kang and Hyuck-Jin Park*
강경희 ·박혁진*

Dept. of Geoinformation Engineering, Sejong University, Seoul, Korea
세종대학교 지구정보공학과
Received February 1, 2019; Revised March 5, 2019; Accepted March 10, 2019.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the machine learning techniques, the sampling strategy of the training data affects a performance of the prediction model such as generalizing ability as well as prediction accuracy. Especially, in landslide susceptibility analysis, the data sampling procedure is the essential step for setting the training data because the number of non-landslide points is much bigger than the number of landslide points. However, the previous researches did not consider the various sampling methods for the training data. That is, the previous studies selected the training data randomly. Therefore, in this study the authors proposed several different sampling methods and assessed the effect of the sampling strategies of the training data in landslide susceptibility analysis. For that, total six different scenarios were set up based on the sampling strategies of landslide points and non-landslide points. Then Random Forest technique was trained on the basis of six different scenarios and the attribute importance for each input variable was evaluated. Subsequently, the landslide susceptibility maps were produced using the input variables and their attribute importances. In the analysis results, the AUC values of the landslide susceptibility maps, obtained from six different sampling strategies, showed high prediction rates, ranges from 70 % to 80 %. It means that the Random Forest technique shows appropriate predictive performance and the attribute importance for the input variables obtained from Random Forest can be used as the weight of landslide conditioning factors in the susceptibility analysis. In addition, the analysis results obtained using specific sampling strategies for training data show higher prediction accuracy than the analysis results using the previous random sampling method.
Keywords : landslide susceptibility, machine learning, Random Forest, training data, sampling strategy


April 2019, 52 (2)