A Comparative Study on the Effects of Removing the Most Important Feature on Random Forest and Support Vector Machine

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Henrik Åkesson; Hampus Fridlund; [2023]

Nyckelord: ;

Sammanfattning: Machine learning (ML) for classification is largely regarded as a “black box”, in that it’s difficult to fully understand how the model reached a decision, and how changes to the input affects the output. Therefore, exploring the inner workings of classification models are of interest for expanding the current knowledge base, providing guidelines for choosing a more suitable classification model for a specific problem. In this study we have focused on the effects on the classification performance of two classifiers, Support Vector Machine (SVM) and Random Forest (RF), when removing the feature from two datasets ranked as most important using two different feature importance methods: SHAP for SVM and Gini Impurity for RF. The two models were first trained on the full featured datasets, then on the datasets with the most important feature removed. The results of removing the most important feature from the dataset led to reduced accuracy for both models, but with a greater reduction for the SVM, while RF remained more stable. This may indicate that SVM is more dependent on the most important feature than RF. What was similar in our results as well as in a previous study, was that RF does not vary as much in accuracy as SVM when selecting a subset of features.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)