  1. 1. Ablation Programming for Machine Learning

    Master-uppsats, KTH/Skolan för elektroteknik och datavetenskap (EECS)

    Författare :Sina Sheikholeslami; [2019]
    Nyckelord :Distributed Machine Learning; Distributed Systems; Ablation Studies; Apache Spark; Keras; Hopsworks;

    As machine learning systems are being used in an increasing number of applications from analysis of satellite sensory data and health-care analytics to smart virtual assistants and self-driving cars they are also becoming more and more complex. This means that more time and computing resources are needed in order to train the models and the number of design choices and hyperparameters will increase as well.

  2. 2. Intelligent Resource Management for Large-scale Data Stream Processing

    Uppsats för yrkesexamina på avancerad nivå, Uppsala universitet/Institutionen för informationsteknologi

    Författare :Oliver Stein; [2019]
    With the increasing trend of using cloud computing resources, the efficient utilization of these resources becomes more and more important. Working with data stream processing is a paradigm gaining in popularity, with tools such as Apache Spark Streaming or Kafka widely available, and companies are shifting towards real-time monitoring of data such as sensor networks, financial data or anomaly detection.

  3. 3. Performance assessment of Apache Spark applications

    Kandidat-uppsats, Linnéuniversitetet/Institutionen för datavetenskap och medieteknik (DM)

    Författare :Salam AL Jorani; [2019]
    Nyckelord :Big Data; Apache Spark; BigBlu; Lazy evaluation of Spark;

    This thesis addresses the challenges of large software and data-intensive systems. We will discuss a Big Data software that consists of quite a bit of Linux configuration, some Scala coding and a set of frameworks that work together to achieve the smooth performance of the system.

  4. 4. Geo-distributed multi-layer stream aggregation

    Master-uppsats, KTH/Skolan för elektroteknik och datavetenskap (EECS)

    Författare :Pietro Cannalire; [2018]
    Nyckelord :stream processing; geo-distributed; architecture; algorithms; windowing; data synopses; Apache Spark Structured Streaming; Apache Kafka; Misra-Gries algorithm; flödesbehandling; geo-distribuerade; arkitekturen; algoritmerna; windowing; data synopses; Apache Spark Structured Streaming; Apache Kafka; Misra-Gries-algoritmen;

    The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture.

  5. 5. Hive, Spark, Presto for Interactive Queries on Big Data

    Master-uppsats, KTH/Skolan för elektroteknik och datavetenskap (EECS)

    Författare :Nikita Gureev; [2018]
    Nyckelord :Hadoop; SQL; interactive analysis; Hive; Spark; Spark SQL; Presto; Big Data;

    Traditional relational database systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data. Apache Hadoop is one of the first open-source tools that provides a distributed data storage system and resource manager.