Anomaly Detection in Log Files Using Machine Learning Techniques

Detta är en Magister-uppsats från Blekinge Tekniska Högskola/Fakulteten för datavetenskaper

Sammanfattning: Context: Log files are produced in most larger computer systems today which contain highly valuable information about the behavior of the system and thus they are consulted fairly often in order to analyze behavioral aspects of the system. Because of the very high number of log entries produced in some systems, it is however extremely difficult to seek out relevant information in these files. Computer-based log analysis techniques are therefore indispensable for the method of finding relevant data in log files. Objectives: The major problem is to find important events in log files. Events in the test suite such as connections error or disruption are not considered abnormal events. Rather the events which cause system interruption must be considered abnormal events. The goal is to use machine learning techniques to "learn" what an"expected" behavior of a particular test suite is. This means that the system must be able to learn to distinguish between a log file that has an anomaly, and which does not have an anomaly based on the previous sequences. Methods: Various algorithms are implemented and compared to other existing algorithms based on their performance. The algorithms are executed on a parsed set of labeled log files and are evaluated by analyzing the anomalous events contained in the log files by conducting an experiment using the algorithms. The algorithms used were Local Outlier Factor, Random Forest, and Term Frequency Inverse DocumentFrequency. We then use clustering using KMeans and PCA to gain some valuable insights from the data by observing groups of data points to find the anomalous events. Results: The results show that the Term Frequency Inverse Document Frequency method works better in finding the anomalous events in the data compared to the other two approaches after conducting an experiment which is discussed in detail. Conclusions: The results will help developers to find the anomalous events without manually looking at the log file row by row. The model provides the events which are behaving differently compared to the rest of the event in the log and that causes the system to interrupt.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)