Dataset versioning for Hops File System : Snapshotting solution for reliable and reproducible data science experiments

Detta är en Master-uppsats från KTH/Skolan för informations- och kommunikationsteknik (ICT)

Författare: Braulio Grana Gutiérrez; [2017]

Nyckelord: ;

Sammanfattning: As the awareness of the potential of Big Data arises, more and more companies are starting to create their own Data Science divisions and their projects are becoming big and complex handled by big multidisciplinary teams. Furthermore, with the expansion of fields such as Deep Learning, Data Science is becoming a very popular research field both in companies and universities. In this context it becomes crucial for Data Scientists to be able to reproduce their experiments and test them against previous models developed in previous versions of a dataset. This Master Thesis project presents the design and implementation of a snapshotting system for the distributed File System HopsFS based on Apache HDFS and developed at the Swedish Institute of Computer Science (SICS). This project improves on previous solutions designed for both HopsFS and HDFS by solving problems such as the handling of incomplete blocks in snapshots while also adding new features such as the automatic snapshots to allow users to undo the last few changes made in a file. Finally, an analysis of the implementation was performed in order to compare it to the previous state of HopsFS and calculate the impact of the solution on the different operations performed by the system. Said analysis showed an increase of around 40% in the time needed to perform operations such as read and write with different workloads due mostly to the new database queries used in this solution.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)