Evaluating Random Forest and k-Nearest Neighbour Algorithms on Real-Life Data Sets

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Computers can be used to classify various types of data, for example to filter email messages, detect computer viruses, detect diseases, etc. This thesis explores two classification algorithms, random forest and k-nearest neighbour, to understand how accurately and how quickly they classify data. A literature study was conducted to identify the various prerequisites and to find suitable data sets. Five different data sets, leukemia, credit card, heart failure, mushrooms and breast cancer, were gathered and classified by each algorithm. A train split and a 4-fold cross-validation for each data set was used. The Rust library SmartCore, which included numerous classification methods and tools, was used to perform the classification. The results gathered indicated that using the train split resulted in better classification results, as opposed to 4-fold cross-validation. However, it could not be determined if any attributes of a data set affect the classification accuracy. Random forest managed to achieve the best classification results on the two data sets heart failure and leukemia, whilst k-nearest neighbour achieved the best classification results on the remaining three data sets. In general the classification results on both algorithms were similar. Based on the results, the execution time of random forest was dependent on the number of trees in the ”forest”, in which a greater number of trees resulted in an increased execution time. In contrast, a higher k value did not increase the execution time of k-nearest neighbour. It was also found that data sets with only binary values (0 and 1) run much faster than a data set with arbitrary values when using random forest. The number of instances in a data set also leads to an increased execution time for random forest despite a small number of features. The same applied to k-nearest neighbour, but with the number of features also affecting the execution since time is needed to compute distances between data points. Random forest managed to achieve the fastest execution time on the two data sets credit card and mushrooms, whilst k-nearest neighbour executed faster on the remaining three data sets. The difference in execution time between the algorithms varied a lot and this depends on the parameter value chosen for the respective algorithm.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)