Active learning for text classification in cyber security

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: In the domain of cyber security, machine learning promises advanced threat detection. However, the volume of available unlabeled data poses challenges for efficient data management. This study investigates the potential for active learning, a subset of interactive machine learning, to reduce the effort required for manual data labelling. Through different query strategies, the most informative unlabeled data points were selected for manual labelling. The performance of different query strategies was assessed by testing a transformer model’s ability to accurately distinguish tweets mentioning names of advanced persistent threats. The findings suggest that the K-means diversity-based query strategy outperformed both the uncertainty-based approach and the random data point selection, when the amount of labelled training data was limited. This study also evaluated the cost-effective active learning approach, which incorporates high-confidence data points into the training dataset. However, this was shown to be the least effective strategy. Lastly, the study acknowledges that the computational time taken for each query strategy varies significantly between strategies. Hence, an optimal query strategy selection requires a balanced consideration of F-score performance taken together with time efficiency.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)