Purging Sensitive Data in Logs Using Machine Learning

Detta är en Uppsats för yrkesexamina på avancerad nivå från Uppsala universitet/Institutionen för informationsteknologi

Författare: Simon Ljus; [2020]

Nyckelord: GDPR; LSTM; RNN; GloVe; personal data; machine learning;

Sammanfattning: This thesis investigates how to remove personal data from logs using machine learning when rule-based scripts are not enough and manual scanning is too extensive. Three types of machine learning models were created and compared. One word model using logistic regression, another word model using LSTM and a sentence model also using LSTM. Data logs were cleaned and annotated using rule-based scripts, datasets from various countries and dictionaries from various languages. The created dataset for the sentence based model was imbalanced, and a lite version of data augmentation was applied. A hyperparameter optimization library was used to find the best hyperparameter combination. The models learned the training and the validation set well but did perform worse on the test set consisting of log data from a different server logging other types of data.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)