Optimisation of Ad-hoc analysis of an OLAP cube using SparkSQL

Detta är en Uppsats för yrkesexamina på avancerad nivå från Uppsala universitet/Avdelningen för beräkningsvetenskap

Författare: Milja Aho; [2017]

Nyckelord: ;

Sammanfattning: An Online Analytical Processing (OLAP) cube is a way to represent a multidimensional database. The multidimensional database often uses a star schema and populates it with the data from a relational database. The purpose of using an OLAP cube is usually to find valuable insights in the data like trends or unexpected data and is therefore often used within Business intelligence (BI). Mondrian is a tool that handles OLAP cubes that uses the query language MultiDimensional eXpressions (MDX) and translates it to SQL queries. Apache Kylin is an engine that can be used with Apache Hadoop to create and query OLAP cubes with an SQL interface. This thesis investigates whether the engine Apache Spark running on a Hadoop cluster is suitable for analysing OLAP cubes and what performance that can be expected. The Star Schema Benchmark (SSB) has been used to provide Ad-Hoc queries and to create a large database containing over 1.2 billion rows. This database was created in a cluster in the Omicron office consisting of five worker nodes and one master node. Queries were then sent to the database using Mondrian integrated into the BI platform Pentaho. Amazon Web Services (AWS) has also been used to create clusters with 3, 6 and 15 slaves to see how the performance scales. Creating a cube in Apache Kylin on the Omicron cluster was also tried, but was not possible due to the cluster running out of memory. The results show that it took between 8.2 to 11.9 minutes to run the MDX queries on the Omicron cluster. On both the Omicron cluster and the AWS cluster, the SQL queries ran faster than the MDX queries. The AWS cluster ran the queries faster than the Omicron cluster, even though fewer nodes were used. It was also noted that the AWS cluster did not scale linearly, neither for the MDX nor the SQL queries.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)