ON EVALUATING MACHINE LEARNING APPROACHES FOR EFFICIENT CLASSIFICATION OF TRAFFIC PATTERNS

Detta är en Master-uppsats från Blekinge Tekniska Högskola/Institutionen för datalogi och datorsystemteknik

Sammanfattning: Context. With the increased usage of mobile devices and internet, the cellular network traffic has increased tremendously. This increase in network traffic has led to increased occurrences of communication failures among the network nodes. Each communication failure among the nodes is defined as a bad event and occurrence of one such bad event acts as a source of origin for several consecutive bad events. These bad events as a whole may eventually lead to node failures (not being able to respond to any data requests). But it requires a lot of human effort and cost to be invested in by the telecom companies to implement workarounds for these node failures. So, there is a need to prevent node failures from happening. This can be done by classifying the traffic patterns between nodes in the network, identify bad events in them and deliver the verdict immediately after their detection. Objectives. Through this study, we aim to find the best suitable machine learning algorithm which can efficiently classify the traffic patterns of SGSN-MME (SGSN (Serving GPRS (General Packet Radio Service) Support node) and MME (Mobility Management Entity). SGSN-MME is a network management tool designed to support the functionalities of two nodes namely SGSN and MME. We do this by evaluating the classification performance of four machine learning classification algorithms, namely Support vector machines (SVMs), Naïve Bayes, Decision trees and Random forests, on the traffic patterns of SGSN and MME. The selected classification algorithm will be developed in such a way that, whenever it detects a bad event, it notifies the user about it by prompting a message saying, “Something bad is happening”. Methods. We have conducted an experiment for evaluating the classification performance of our four chosen classification algorithms on the dataset provided by Ericsson AB, Gothenburg. The experimental dataset is a combination of three logs, one of which represents the traffic patterns in real network and the other two logs contain synthetic traffic patterns that are generated manually. The dataset is unlabeled with 720 data instances and 4019 attributes in it. K-means clustering is performed for dividing the data instances into groups and thereby proceed with labeling them accordingly into good and bad events. Also, since the number of attributes in the experimental dataset are more than the number of instances, feature selection is performed for selecting the subset of relevant attributes which best represents the whole data. All the chosen classification algorithms are trained and tested with ten-fold cross validation sets using the selected subset of attributes and the obtained performance measures like classification accuracy, F1 score and training time are analyzed and compared for selecting the best suitable one among them. Finally, the chosen algorithm is tested on unlabeled real data and the performance measures are analyzed in order to check if is able to detect the bad events correctly or not. Results. Experimental results showed that Random forests outperformed Support vector machines, Naïve Bayes and Decision trees with an average classification accuracy of 99.72% and average F1 score of 99.6, when classification accuracy and F1 score are considered. On the other hand, Naive Bayes outperformed Support vector machines, Decision trees and Random forests with an average training time of 0.010 seconds, when training time is considered. Also, the classification accuracy and F1 score of Random forests on unlabeled data are found to be 100% and 100 respectively. Conclusions. Since our study focuses on classifying the traffic patterns of SGSN-MME more accurately, classification accuracy and F1 score are of highest importance than the training time of algorithm. Therefore, based on experimental results, we conclude that Random forests is the best suitable machine learning algorithm for classifying the traffic patterns of SGSN -MME. However, Naive Bayes can be also used if classification has to be performed in the least time possible and with moderate accuracy (around 70%). 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)