Does the removal of correlated variables affect the classification accuracy of machine learning algorithms?

Detta är en Kandidat-uppsats från Uppsala universitet/Statistiska institutionen

Sammanfattning: The last decades have seen an increase in both the amount and complexity of the data used in modern industries in business and technology. A key element for managing these data sets is using machine learning algorithms to process structures and find patterns. Variable selection applies to facilitate and improve these processes by finding and removing redundant variables. One way to achieve this is by eliminating variables based on how much they correlate, a premise for this thesis. This study examines how a reduction of correlated variables affects the predictive accuracy of six different machine learning algorithms. Two demarcations are made. First, the correlation between the explanatory variables is set to a high level and secondly, each variable’s correlation with the dependent variable is set to a modest level. The hypothesis states that removing highly correlated explanatory variables should not negatively affect the accuracy. By conducting a Monte Carlo simulation with three models, each consisting of a different number of correlated variables, the change in accuracy could be compared and evaluated. The result suggests an adverse change in accuracy for all algorithms except one. The differences are relatively low, with the largest accuracy decrease being -5.49 percentage points. The conclusion is that the hypothesis does not hold when the explanatory variables are at a modest level of correlation with the dependent variable. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)