BigDataCube: Distributed Multidimensional Data Cube Over Apache Spark : An OLAP framework that brings Multidimensional Data Analysis to modern Distributed Storage Systems

Detta är en Master-uppsats från KTH/Skolan för informations- och kommunikationsteknik (ICT)

Författare: Pradeep Peiris Weherage; [2017]

Nyckelord: ;

Sammanfattning: Multidimensional Data Analysis is an important subdivision of Data Analytic paradigm. Data Cube provides the base abstraction for Multidimensional Data Analysis and helps in discovering useful insights of a dataset. On-Line Analytical Processing (OLAP) enhanced it to the next level supporting online responses to analytical queries with the underlying technique that precomputes (materializes) the data cubes. Data Cube Materialization is significant for OLAP, but it is an expensive task in term of data processing and storage. Most of the early decision support system benefits the value of multidimensional data analysis with a standard data architecture that extract, transform and load data from multiple data sources into a centralized database called Data Warehouse, on which OLAP engines provides the data cube abstraction. But this architecture and traditional OLAP engines do not hold with modern intensive datasets. Today, we have distributed data storage systems that keep data on a cluster of computer nodes, in which distributed data processing engines like MapReduce, Spark, Storm, etc. provide more ad-hoc style data analytical capabilities. Yet, there is no proper distributed system approach available for multidimensional data analysis, nor any distributed OLAP engine is available that follows distributed data cube materialization. It is essential to have a proper Distributed Data Cube Materialization mechanism to support multidimensional data analysis over the present distributed storage systems. Various research work available today which considered MapReduce for data cube materialization. Also, Apache Spark recently enabled CUBE operator as part of their DataFrame API. The thesis raises the problem statement, the best-distributed system approach for Data Cube Materialization, MapReduce or Spark? and contributes with experiments that compare the two distributed systems in materializing data cubes over the number of records, dimensions and cluster size. The results confirm Spark is more scalable and efficient in data cube materialization than MapReduce. The thesis further contributed with a novel framework, BigDataCube, which uses Spark DataFrames underneath for materializing data cubes and fulfills the need of multidimensional data analysis for modern distributed storage systems.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)