Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning

Detta är en Master-uppsats från Högskolan Dalarna/Mikrodataanalys

Författare: Felipe Dutra Calainho; [2018]

Nyckelord: Data mining; resampling; ensemble.;

Sammanfattning: The performance of supervised machine learning algorithms is highly dependent on the distribution of the target variable. Infrequent values are more di_cult to predict, as there are fewer examples for the algorithm to learn patterns that contain those values. These infrequent values are a common problem with real data, being the object of interest in many _elds such as medical research, _nance and economics, just to mention a few. Problems regarding classi_cation have been comprehensively studied. For regression, on the other hand, few contributions are available. In this work, two ensemble methods from classi_cation are adapted to the regression case. Additionally, existing oversampling techniques, namely SmoteR, are tested. Therefore, the aim of this research is to examine the inuence of oversampling and ensemble techniques over the accuracy of regression models when predicting infrequent values. To assess the performance of the proposed techniques, two data sets are used: one concerning house prices, while the other regards patients with Parkinson's Disease. The _ndings corroborate the usefulness of the techniques for reducing the prediction error of infrequent observations. In the best case, the proposed Random Distribution Sample Ensemble reduced the overall RMSE by 8.09% and the RMSE for infrequent values by 6.44% when compared with the best performing benchmark for the housing data set.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)