Auto-Tuning Apache Spark Parameters for Processing Large Datasets

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Apache Spark is a popular open-source distributed processing framework that enables efficient processing of large amounts of data. Apache Spark has a large number of configuration parameters that are strongly related to performance. Selecting an optimal configuration for Apache Spark application deployed in a cloud environment is a complex task. Making a poor choice may not only result in poor performance but also increases costs. Manually adjusting the Apache Spark configuration parameters can take a lot of time and may not lead to the best outcomes, particularly in a cloud environment where computing resources are allocated dynamically, and workloads can fluctuate significantly. The focus of this thesis project is the development of an auto-tuning approach for Apache Spark configuration parameters. Four machine learning models are formulated and evaluated to predict Apache Spark’s performance. Additionally, two models for Apache Spark configuration parameter search are created and evaluated to identify the most suitable parameters, resulting in the shortest execution time. The obtained results demonstrates that with the developed auto-tuning approach and adjusting Apache Spark configuration parameters, Apache Spark applications can achieve a shorter execution time than when using the default parameters. The developed auto-tuning approach gives an improved cluster utilization and shorter job execution time, with an average performance improvement of 49.98%, 53.84%, and 64.16% for the three different types of Apache Spark applications benchmarked.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)