Impact of log file processing on learning speed and defect classification accuracy

Anton Kaiafiuk

doi:10.30857/2786-5371.2025.2.2

Authors

Anton Kaiafiuk Kyiv National University of Technologies and Design, Ukraine

DOI:

https://doi.org/10.30857/2786-5371.2025.2.2

Keywords:

regular expressions, lemmatisation, vectorisation, machine learning, test automation

Abstract

The purpose of the study was to investigate the effect of automatic testing log file preprocessing on the speed of vectorisation and training of machine learning models. The HDFS_v3_TraceBench set was used, which contains more than 370 thousand traces collected in the Hadoop Distributed File System Environment. Processing included noise removal, lemmatisation, and duplication reduction. The data was vectorised using the Term frequency – inverse document frequency method, and then the RandomForestClassifier model was trained. The experimental results showed that optimising the input data reduced the total processing time by almost five times. The time required for text vectorisation and model training has been reduced, which helped to speed up work with large volumes of logs. However, the classification accuracy was not only preserved, but also showed a slight improvement: the F1-score and Matthews correlation coefficient indicators remained consistently high. There was also a decrease in the Log Loss value, which indicated an increase in the model’s confidence in its own forecasts. This is especially important in the context of unbalanced classes that are characteristic of defect classification problems. A detailed analysis showed that a significant part of the service and repetitive information in the logs is not critical for training the model, and its removal, on the contrary, improves the quality of data preparation. In the course of the study, it was also confirmed that the resulting target labels for logs correspond to typical error classes. Implemented log file processing not only reduces computational costs, but also supports or improves the quality of forecasting. These results confirmed the feasibility of including the log cleaning and optimisation step in the overall process of building machine learning models for automated testing. The results obtained can be integrated into automated pipelines for classifying defects and generating bug reports. This will help to reduce the amount of manual labour and increase the efficiency of teams

Downloads

Download data is not yet available.

Author Biography

Anton Kaiafiuk, Kyiv National University of Technologies and Design, Ukraine

https://orcid.org/0009-0003-9917-0834

Impact of log file processing on learning speed and defect classification accuracy

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

Anton Kaiafiuk, Kyiv National University of Technologies and Design, Ukraine

Downloads

Published

How to Cite

Issue

Section

Language