Measures of statistical dependence for feature selection : Computational study

Detta är en Magister-uppsats från Umeå universitet/Statistik

Författare: Mohamad Alshalabi; [2022]

Nyckelord: ;

Sammanfattning: The importance of feature selection for statistical and machine learning models derives from their explainability and the ability to explore new relationships, leading to new discoveries. Straightforward feature selection methods measure the dependencies between the potential features and the response variable. This thesis tries to study the selection of features according to a maximal statistical dependency criterion based ongeneralized Pearson’s correlation coefficients, e.g., Wijayatunga’s coefficient. I present a framework for feature selection based on these coefficients for high dimensional feature variables. The results are compared to the ones obtained by applying an elastic net regression (for high-dimensional data). The generalized Pearson’s correlation coefficient is a metric-based measure where the metric is Hellinger distance. The metric is considered as the distance between probability distributions. The Wijayatunga’s coefficient is originally proposed for the discrete case; here, we generalize it for continuous variables by discretization and kernelization. It is interesting to see how discretization work as we discretize the bins finer. The study employs both synthetic and real-world data to illustrate the validity and power of this feature selection process. Moreover, a new method of normalization for mutual information is included. The results show that both measures have considerable potential in detecting associations. The feature selection experiment shows that elastic net regression is superior to our proposed method; nevertheless, more investigation could be done regarding this subject.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)