Comparison of Popular Data Processing Systems

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Kamil Nasr; [2021]

Nyckelord: Apache Spark; Apache Flink; Apache Beam; Spark Runner; Flink Runner; Direct Runner; Big Data Analytics; Data Processing Systems; Benchmarking; Kaggle;

Sammanfattning: Data processing is generally defined as the collection and transformation of data to extract meaningful information. Data processing involves a multitude of processes such as validation, sorting summarization, aggregation to name a few. Many analytics engines exit today for largescale data processing, namely Apache Spark, Apache Flink and Apache Beam. Each one of these engines have their own advantages and drawbacks. In this thesis report, we used all three of these engines to process data from the Carbon Monoxide Daily Summary Dataset to determine the emission levels per area and unit of time. Then, we compared the performance of these 3 engines using different metrics. The results showed that Apache Beam, while offered greater convenience when writing programs, was slower than Apache Flink and Apache Spark. Spark Runner in Beam was the fastest runner and Apache Spark was the fastest data processing framework overall.

HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)

Comparison of Popular Data Processing Systems

Sökningar just nu

Populära sökningar

Uppsatser med många visningar igår (2024-04-15)