Evaluating Feature Selection Methods for Automated Computer Based Diagnostics of Diabetes

Detta är en Kandidat-uppsats från KTH/Datavetenskap

Författare: Erik Wachtmeister; Noel Karlsson Johansson; [2022]

Nyckelord: ;

Sammanfattning: Diabetes affects roughly 8% of the world population over the age of 18 and causes approximately 5 million deaths for people over the age of 20 each year. Furthermore, maybe half of all people with diabetes are undiagnosed. They could be diagnosed with automated computer-based diagnostics, but machine learning has its challenges. For an effective diagnosis quality data representations are need, and this is where feature selection could help. Feature selection (FS) aims to improve the quality of any given dataset, by hopefully, reducing the computational cost, boosting the accuracy and interpretability of resulting models. We have studied feature selection methods for diabetes datasets. Results suggest FS, for diabetes datasets, can reduce the dataset size but half or more whilst not sacrificing performance. In general, the improvement from FS is dependent on the classifier, although no significant performance gain was proven. Of the studied, the best FS methods for reducing the dataset size whilst retaining performance were f-score and mutual information. Comparison between linear and non-linear classifiers combined with feature selection is done using Naive Bayes and Support Vector Machines with linear kernel and radial basis function kernel. The results do not yield enough evidence to support the hypothesis that FS methods affects linear and non-linear classifiers differently. What features are useful for predicting diabetes from a feature selection perspective is very dataset dependent. However, if multiple datasets are examined FS could provide another interesting perspective on what features might be diabetes risk factors.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)