An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Machine learning methods can be used for solving important binary classification tasks in domains such as display advertising and recommender systems. In many of these domains categorical features are common and often of high cardinality. Using one-hot encoding in such circumstances lead to very high dimensional vector representations, causing memory and computability concerns for machine learning models. This thesis investigated the viability of a binary encoding scheme in which categorical values were mapped to integers that were then encoded in a binary format. This binary scheme allowed for representing categorical features using log2(d)-dimensional vectors, where d is the dimension associated with a one-hot encoding. To evaluate the performance of the binary encoding, it was compared against one-hot and feature hashed representations with the use of linear logistic regression and neural networks based models. These models were trained and evaluated using data from two publicly available datasets: Criteo and Census. The results showed that a one-hot encoding with a linear logistic regression model gave the best performance according to the PR-AUC metric. This was, however, at the expense of using 118 and 65,953 dimensional vector representations for Census and Criteo respectively. A binary encoding led to a lower performance but used only 35 and 316 dimensions respectively. For Criteo, binary encoding suffered significantly in performance and feature hashing was perceived as a more viable alternative. It was also found that employing a neural network helped mitigate any loss in performance associated with using binary and feature hashed representations.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)