Sequential Anomaly Detection for Log Data Using Deep Learning

Detta är en Master-uppsats från Göteborgs universitet/Institutionen för matematiska vetenskaper

Sammanfattning: AbstractSoftware development with continuous integration changes needs frequent testing forassessment. Analyzing the test output manually is time-consuming and automatingthis process could be beneficial to an organization. The goal of this thesis project isto do the automated anomaly detection analysis of software test output files providedby Volvo Group Trucks Technology, to achieve this we evaluated four different neuralnetwork architectures. The four neural network architectures are two recurrentneural networks with long short-term memory (LSTM) where one is unidirectionaland one is bidirectional as well as two autoencoders (an LSTM-based sequence-tosequencemodel and a Transformer) that aim to reconstruct a sequence from the files.In order to evaluate the performance of the neural network architectures two datasetswere utilized. The first dataset is from the Hadoop Distributed File System (HDFS)and this is a publicly available dataset where all logs are labelled as either anomalousor non-anomalous. The second dataset are log files resulting from software testingprovided by Volvo Group Trucks Technology which contain no labels. The networkswere evaluated in two different settings when trained on the HDFS data. In the firstsetting the logs labelled as anomalous were filtered out making it a semi-supervisedapproach and in the second setting the logs labelled as anomalous were kept whichmakes it an unsupervised approach. Lastly the networks were trained on the dataprovided by Volvo Group Trucks Technology which is unlabeled, the objective ofapproach is to evaluate how the networks perform in an unsupervised setting. Inaddition, an analysis of the size of the data sets used to train the networks wereperformed.The results show that for the data provided by Volvo Group Trucks Technology thesize of the dataset used for training the networks influenced the performance of theanomaly detection where a smaller dataset performed better than a larger dataset.Moving on to the HDFS dataset, a smaller dataset for the unsupervised setting wasalso better than a larger dataset. However, for the HDFS data the semi-supervisedapproach outperformed the unsupervised setting regardless of the size of the trainingdataset.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)