Special Research Paper on “Applications of Data Science and Artificial Intelligence in Economic and Environmental Geology”

Split Viewer

Econ. Environ. Geol. 2024; 57(5): 539-550

Published online October 29, 2024

https://doi.org/10.9719/EEG.2024.57.5.539

© THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY

Development of a Large-scale Korean Language Model in the Field of Geosciences

Sang-ho Lee*

Mineral Resources Division, Korea Institute of Geosciences and Mineral Resources, Daejeon 34132, Republic of Korea

Correspondence to : *energy@kigam.re.kr

Received: August 30, 2024; Revised: October 8, 2024; Accepted: October 10, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.

Abstract

With the rapid development and commercialization of large-scale generative language models, concerns regarding the appropriateness of model outputs, expertise, and data security have been emerged. In particular, Korean generative language models specialized in the field of geoscience have not yet been studied due to difficulties in data processing, preprocessing and a lack of development cases. This study conducted the entire process for developing a Korean language model specialized in the field of geoscience and evaluated its applicability in related fields. To achieve this, academic data related to geoscience were collected and preprocessed to create a dataset suitable for the training of the language model. The dataset was applied to the Llama2 model for the training. The trained model was quantitatively evaluated using 19 different evaluation datasets from various fields. The results demonstrated improved functionalities related to scientific question-answering and Korean text interpretation compared to the original model. The language model developed through this study can potentially enhance research productivity in the field of geoscience, offering benefits such as idea generation. The outcomes of this study are expected to stimulate further research and the utilization of generative language models in geoscience in the future.

Keywords large language model, generative model, natural language processing, artificial intelligence, geoscience

지질과학 분야 한국어 대규모 언어 모델 개발

이상호*

한국지질자원연구원 광물자원연구본부 선임연구원

요 약

최근 대규모 생성형 언어 모델의 급격한 발달과 상용화가 이루어지면서 모델 출력의 적정성, 전문성 문제 및 데이터 보안 문제가 제기되고 있다. 특히 지질과학 유관 분야에서는 가공된 자료 및 전처리의 어려움과 개발 사례의 부족으로 인해 해당 분야에 특화된 한국어 언어 모델 개발은 아직 진행된 사례가 없다. 이에 따라 본 연구에서는 지질과학 분야에 특화된 한국어 언어 모델 개발을 위한 전반적인 과정을 수행하고 이를 평가함으로써 유관 분야에서의 적용 가능성을 알아보고자 하였다. 이를 위하여 지질과학 유관 분야의 학술 자료를 수집하고 전처리하여 언어 모델의 학습에 적합한 자료를 준비하고, 이를 Llama 2 모델에 적용하여 사전학습 및 미세조정을 수행하였다. 학습된 모델은 19종의 분야별 평가용 데이터셋을 이용하여 정량적으로 평가하였으며, 그 결과 원본 모델 대비 과학 관련 질의응답 및 및 한국어 지문 해석 관련 기능이 향상된 것으로 나타났다. 본 연구를 통해 개발된 언어 모델은 유관 분야에서 아이디어 창출과 같은 연구 생산성 제고에 기여할 수 있으며, 향후 언어 모델을 활용한 연구 및 활용을 활성화할 수 있을 것으로 기대된다.

주요어 대규모 언어 모델, 생성형 모델, 자연어 처리, 인공지능, 지질과학

Article

Special Research Paper on “Applications of Data Science and Artificial Intelligence in Economic and Environmental Geology”

Econ. Environ. Geol. 2024; 57(5): 539-550

Published online October 29, 2024 https://doi.org/10.9719/EEG.2024.57.5.539

Copyright © THE KOREAN SOCIETY OF ECONOMIC AND ENVIRONMENTAL GEOLOGY.

Development of a Large-scale Korean Language Model in the Field of Geosciences

Sang-ho Lee*

Mineral Resources Division, Korea Institute of Geosciences and Mineral Resources, Daejeon 34132, Republic of Korea

Correspondence to:*energy@kigam.re.kr

Received: August 30, 2024; Revised: October 8, 2024; Accepted: October 10, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided original work is properly cited.

Abstract

With the rapid development and commercialization of large-scale generative language models, concerns regarding the appropriateness of model outputs, expertise, and data security have been emerged. In particular, Korean generative language models specialized in the field of geoscience have not yet been studied due to difficulties in data processing, preprocessing and a lack of development cases. This study conducted the entire process for developing a Korean language model specialized in the field of geoscience and evaluated its applicability in related fields. To achieve this, academic data related to geoscience were collected and preprocessed to create a dataset suitable for the training of the language model. The dataset was applied to the Llama2 model for the training. The trained model was quantitatively evaluated using 19 different evaluation datasets from various fields. The results demonstrated improved functionalities related to scientific question-answering and Korean text interpretation compared to the original model. The language model developed through this study can potentially enhance research productivity in the field of geoscience, offering benefits such as idea generation. The outcomes of this study are expected to stimulate further research and the utilization of generative language models in geoscience in the future.

Keywords large language model, generative model, natural language processing, artificial intelligence, geoscience

지질과학 분야 한국어 대규모 언어 모델 개발

이상호*

한국지질자원연구원 광물자원연구본부 선임연구원

Received: August 30, 2024; Revised: October 8, 2024; Accepted: October 10, 2024

요 약

최근 대규모 생성형 언어 모델의 급격한 발달과 상용화가 이루어지면서 모델 출력의 적정성, 전문성 문제 및 데이터 보안 문제가 제기되고 있다. 특히 지질과학 유관 분야에서는 가공된 자료 및 전처리의 어려움과 개발 사례의 부족으로 인해 해당 분야에 특화된 한국어 언어 모델 개발은 아직 진행된 사례가 없다. 이에 따라 본 연구에서는 지질과학 분야에 특화된 한국어 언어 모델 개발을 위한 전반적인 과정을 수행하고 이를 평가함으로써 유관 분야에서의 적용 가능성을 알아보고자 하였다. 이를 위하여 지질과학 유관 분야의 학술 자료를 수집하고 전처리하여 언어 모델의 학습에 적합한 자료를 준비하고, 이를 Llama 2 모델에 적용하여 사전학습 및 미세조정을 수행하였다. 학습된 모델은 19종의 분야별 평가용 데이터셋을 이용하여 정량적으로 평가하였으며, 그 결과 원본 모델 대비 과학 관련 질의응답 및 및 한국어 지문 해석 관련 기능이 향상된 것으로 나타났다. 본 연구를 통해 개발된 언어 모델은 유관 분야에서 아이디어 창출과 같은 연구 생산성 제고에 기여할 수 있으며, 향후 언어 모델을 활용한 연구 및 활용을 활성화할 수 있을 것으로 기대된다.

주요어 대규모 언어 모델, 생성형 모델, 자연어 처리, 인공지능, 지질과학

    Fig 1.

    Figure 1.Evaluation results for each learning stage of the pretraining model using 9 evaluation datasets. (a) 0-shot; (b) 5-shot.
    Economic and Environmental Geology 2024; 57: 539-550https://doi.org/10.9719/EEG.2024.57.5.539

    Fig 2.

    Figure 2.Comparison of 0-shot scores using 9 evaluation datasets for models trained using each instruction tuning dataset.
    Economic and Environmental Geology 2024; 57: 539-550https://doi.org/10.9719/EEG.2024.57.5.539

    Fig 3.

    Figure 3.Comparison of epoch-by-epoch evaluation results using 9 evaluation datasets for alignment tuning models. (a) 0-shot; (b) 5-shot.
    Economic and Environmental Geology 2024; 57: 539-550https://doi.org/10.9719/EEG.2024.57.5.539

    Fig 4.

    Figure 4.Average score of final evaluation results using 19 evaluation datasets.
    Economic and Environmental Geology 2024; 57: 539-550https://doi.org/10.9719/EEG.2024.57.5.539

    Fig 5.

    Figure 5.Relative comparison of the final evaluation results of each stage model, with the original base model’s score set to 1.
    Economic and Environmental Geology 2024; 57: 539-550https://doi.org/10.9719/EEG.2024.57.5.539

    Fig 6.

    Figure 6.Comparison of evaluation results of science-related datasets by model.
    Economic and Environmental Geology 2024; 57: 539-550https://doi.org/10.9719/EEG.2024.57.5.539

    Table 1 . Hyperparameters and time required for the pre-training.

    Hyperparameter and variableValue
    Learning rate1e-5
    Learning rate scheduler typeconstant with warmup
    Warmup steps100
    Torch data typebfloat16
    Per-device batch size4
    Gradient accumulation steps8
    Training steps5,506
    Training runtime(hr)182.4

    Table 2 . Hyperparameters and time required for the fine-tuning.

    Hyperparameter and variableInstruction tuningAlignment tuning
    Learning rate3e-5
    Learning rate scheduler typecosine
    Torch data typebfloat16
    Per-device batch size4
    Gradient accumulation steps8
    Target modulesq_proj, v_proj
    Training epochs5
    Training steps per epoch363484
    Training runtime per epoch(hr)5.9121.4

    Table 3 . Examples of responses to the same prompt entered in 3 different models.


    Table 4 . Datasets used for the evaluation of the language model.


    Table 5 . Final quantitative evaluation results for each model.

    RawPretrainedInstructDPO-1DPO-2DPO-3DPO-4DPO-5
    ARC-C0.41300.41720.44450.45310.45990.46420.46160.4642
    ARC-E0.68860.70120.72470.73190.73110.73150.72980.7298
    BoolQ0.77890.74950.76790.79270.79270.79690.80640.8034
    COPA0.88000.87000.89000.89000.89000.88000.88000.8800
    AP-Geo-K0.28890.28530.31180.31680.31180.31110.31110.3125
    HT-A0.46050.48030.46710.44740.46710.48030.47370.4737
    HT-CB0.27080.29860.35420.38890.38190.39580.38890.4028
    HT-CC0.25000.25000.34000.33000.34000.35000.39000.3800
    HT-ColP0.22550.24510.24510.29410.29410.32350.35290.3333
    HT-ConP0.37870.32340.40000.37870.38720.36600.36600.3702
    HT-HG0.36870.36360.53030.53030.53540.53540.54040.5404
    BoolQ-K0.56770.55560.79340.70800.84050.86180.86180.8597
    COPA-K0.79700.77700.78600.79000.79300.79200.79200.7910
    HellaSwag-K0.50000.47800.50400.50000.49400.49800.49600.4960
    SNeg-K0.53400.71280.70530.71030.73050.73800.74560.7431
    OBQA0.32600.30200.32200.31800.32000.32400.32200.3240
    PIQA0.78020.77200.78240.78510.78180.78400.78400.7824
    SciQ0.91200.92200.93300.94300.94300.94600.94400.9420
    WG0.70400.70800.70400.71030.70880.70880.70640.7088
    Averaged Score0.53290.53750.57920.57990.58960.59410.59750.5967

    KSEEG
    Feb 28, 2025 Vol.58 No.1, pp. 1~97

    Stats or Metrics

    Share this article on

    • kakao talk
    • line

    Related articles in KSEEG

    Economic and Environmental Geology

    pISSN 1225-7281
    eISSN 2288-7962
    qr-code Download