DOI: 10.18178/wcse.2025.06.036
Relevance-Based Data Selection for BERT-Based Anomaly Detection Using Unstructured Logs
Abstract— As software systems grow in scale and complexity, log-based automated anomaly detection has become an essential tool for maintaining system reliability. However, machine learning and deep learning-based detection methods typically require pre-labeled data for training, which poses a challenge due to the vast volume and repetitive nature of logs generated by large systems. Furthermore, log formats often evolve with system updates, making traditional log parsers prone to errors that can negatively affect anomaly detection performance. To address these challenges, this study proposes a robust anomaly detection system that directly processes unstructured logs. The system employs a BERT tokenizer for tokenization and utilizes relevance-based selection and clustering techniques to extract less than 0.01% of high-quality training data from millions of unlabeled logs. Additionally, BERT is leveraged to capture both sequential and semantic information in the logs, facilitating the automated detection of normal and anomalous patterns. Experimental results demonstrate that the proposed method achieves an F1-score exceeding 0.96 across four supercomputer datasets, and an F1-score above 0.91 for the detection of previously unseen events.
Index Terms— anomaly detection, data selection, relevance, unstructured logs
Wei-Ting Chang, Ren-Hung Hwang, Jian-Liang Pan
National Yang Ming Chiao Tung University, Taiwan
Cite: Wei-Ting Chang, Ren-Hung Hwang, Jian-Liang Pan, "Relevance-Based Data Selection for BERT-Based Anomaly Detection Using Unstructured Logs", 2025 the 15th International Workshop on Computer Science and Engineering (WCSE 2025), pp. 224-234, Jeju Island, South Korea, June 28-30, 2025.
