Genomsökning av filsystem för att hitta personuppgifter : Med Linear chain conditional random field och Regular expression

Detta är en Kandidat-uppsats från Mittuniversitetet/Avdelningen för informationssystem och -teknologi

Sammanfattning: The new General Data Protection Regulation (GDPR) Act will apply to all companies within the European Union after 25 May. This means stricter legal requirements for companies that in some way store personal data. The goal of this project is therefore to make it easier for companies to meet the new legal requirements. This by creating a tool that searches file systems and visually shows the user in a graphical user interface which files contain personal data. The tool uses Named entity recognition with the Linear chain conditional random field algorithm which is a type of supervised learning method in machine learning. This algorithm is used in the project to find names and addresses in files. The different models are trained with different parameters and the training is done using the stanford NER library in Java. The models are tested by a test file containing 45,000 words where the models themselves can predict all classes to the words in the file. The models are then compared with each other using the measurements of precision, recall and F-score to find the best model. The tool also uses Regular Expression to find emails, IP numbers, and social security numbers. The result of the final machine learning model shows that it does not find all names and addresses, but that can be improved by increasing exercise data. However, this is something that requires a more powerful computer than the one used in this project. An analysis of how the Swedish language is built would also need to be done to apply the most appropriate parameters for the training of the model.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)